SQL: Multi field mixed deduplication followed by numbering

The SQL SERVER database table stores personnel records from multiple sources. If one of the Name, Phone, or Email fields in two records is duplicated, it indicates that the two records belong to the same person. Null means that the data is unknown. When both fields of two records are null, it means that they are not duplicated by default. Whether they are duplicated depends on other fields to determine. Note that if A and B are the same, and B and C are the same, then A and C are also the same.

RegisterId

Name

Phone

Email

XXX-00001

John Strauss

241567

Null

XXX-00023

Rick Astley

241567

richardastley@gmail.com

XXX-00003

John Strauss

NULL

NULL

XXX-00099

NULL

241567

georgeharrison@gmail.com

XXX-00085

NULL

256819

richardastley@gmail.com

XXX-00016

NULL

NULL

georgeharrison@gmail.com

XXX-00007

John Deep

280933

NULL

XXX-00008

John Deep

93484

NULL

XXX-00009

Javier Estrada

94578

javier@gmail.com

XXX-00010

NULL

39939

39939@gmail.com

Requirement: Add personnel number 'no' as a calculation column, find duplicate records, and assign an independent 'no' to each set of duplicate records.

RegisterId

Name

Phone

Email

no

XXX-00001

John Strauss

241567


1

XXX-00023

Rick Astley

241567

richardastley@gmail.com

1

XXX-00003

John Strauss



1

XXX-00099


241567

georgeharrison@gmail.com

1

XXX-00085


256819

richardastley@gmail.com

1

XXX-00016



georgeharrison@gmail.com

1

XXX-00007

John Deep

280933


7

XXX-00008

John Deep

93484


7

XXX-00009

Javier Estrada

94578

javier@gmail.com

9

XXX-00010


39939

39939@gmail.com

10

SPL code



1

=mssql.query("select * from tb")

2

=T=A1.derive(#:no)

3

for T.count(T[1:].count( if(no!=T.no && (

(Name && T.Name && Name==T.Name) ||

(Phone && T.Phone && Phone==T.Phone) ||

(Email && T.Email && Email==T.Email)),

T.no=no=min(no,T.no) ))>0)>0

4

return T

A1: Query the database through JDBC.

A2: Add a new number column 'no', which defaults to record number #.

A3: Use an infinite loop to traverse records and adjust the ‘no’ column. If there is a situation where ‘no’ is adjusted after a certain traversal, traverse again until all ‘no’ are no longer adjusted. Traverse records from top to bottom. When the current record is i-th, compare it with the (i+1) th till the last record in sequence. If it is considered a duplicate record, synchronize the 'no' of both records and take the smaller one between them. When the field value is null, it defaults to not being duplicated with other records. Note that null is false when performing logical AND operations on any value.

The count function returns the number of members that meet the criteria. During the comparison process between the i-th record and the i+1 to the last record, if there is an action where ‘no’ is adjusted, the count of the inner layer will be greater than 0, which will cause the count of the outer layer to also be greater than 0, satisfying the condition for continuing the loop.

We still need the condition of ‘no!=T.no’ to ensure that at least one 'no' value in the adjustment action becomes smaller, otherwise an endless loop may occur.

Problem source: https://stackoverflow.com/questions/78410258/group-rows-which-share-some-value-in-three-columns-and-assign-them-a-same-id