SQL: Multi field mixed deduplication followed by numbering
The SQL SERVER database table stores personnel records from multiple sources. If one of the Name, Phone, or Email fields in two records is duplicated, it indicates that the two records belong to the same person. Null means that the data is unknown. When both fields of two records are null, it means that they are not duplicated by default. Whether they are duplicated depends on other fields to determine. Note that if A and B are the same, and B and C are the same, then A and C are also the same.
RegisterId |
Name |
Phone |
|
XXX-00001 |
John Strauss |
241567 |
Null |
XXX-00023 |
Rick Astley |
241567 |
richardastley@gmail.com |
XXX-00003 |
John Strauss |
NULL |
NULL |
XXX-00099 |
NULL |
241567 |
georgeharrison@gmail.com |
XXX-00085 |
NULL |
256819 |
richardastley@gmail.com |
XXX-00016 |
NULL |
NULL |
georgeharrison@gmail.com |
XXX-00007 |
John Deep |
280933 |
NULL |
XXX-00008 |
John Deep |
93484 |
NULL |
XXX-00009 |
Javier Estrada |
94578 |
javier@gmail.com |
XXX-00010 |
NULL |
39939 |
39939@gmail.com |
Requirement: Add personnel number 'no' as a calculation column, find duplicate records, and assign an independent 'no' to each set of duplicate records.
RegisterId |
Name |
Phone |
no |
|
XXX-00001 |
John Strauss |
241567 |
1 |
|
XXX-00023 |
Rick Astley |
241567 |
richardastley@gmail.com |
1 |
XXX-00003 |
John Strauss |
1 |
||
XXX-00099 |
241567 |
georgeharrison@gmail.com |
1 |
|
XXX-00085 |
256819 |
richardastley@gmail.com |
1 |
|
XXX-00016 |
georgeharrison@gmail.com |
1 |
||
XXX-00007 |
John Deep |
280933 |
7 |
|
XXX-00008 |
John Deep |
93484 |
7 |
|
XXX-00009 |
Javier Estrada |
94578 |
javier@gmail.com |
9 |
XXX-00010 |
39939 |
39939@gmail.com |
10 |
SPL code
1 |
=mssql.query("select * from tb") |
2 |
=T=A1.derive(#:no) |
3 |
for T.count(T[1:].count( if(no!=T.no && ( (Name && T.Name && Name==T.Name) || (Phone && T.Phone && Phone==T.Phone) || (Email && T.Email && Email==T.Email)), T.no=no=min(no,T.no) ))>0)>0 |
4 |
return T |
A1: Query the database through JDBC.
A2: Add a new number column 'no', which defaults to record number #.
A3: Use an infinite loop to traverse records and adjust the ‘no’ column. If there is a situation where ‘no’ is adjusted after a certain traversal, traverse again until all ‘no’ are no longer adjusted. Traverse records from top to bottom. When the current record is i-th, compare it with the (i+1) th till the last record in sequence. If it is considered a duplicate record, synchronize the 'no' of both records and take the smaller one between them. When the field value is null, it defaults to not being duplicated with other records. Note that null is false when performing logical AND operations on any value.
The count function returns the number of members that meet the criteria. During the comparison process between the i-th record and the i+1 to the last record, if there is an action where ‘no’ is adjusted, the count of the inner layer will be greater than 0, which will cause the count of the outer layer to also be greater than 0, satisfying the condition for continuing the loop.
We still need the condition of ‘no!=T.no’ to ensure that at least one 'no' value in the adjustment action becomes smaller, otherwise an endless loop may occur.
Problem source: https://stackoverflow.com/questions/78410258/group-rows-which-share-some-value-in-three-columns-and-assign-them-a-same-id
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL