"The SQL SERVER database table stores personnel records from multiple sources. If one of the Name .."

jinxing RaqForum 41 No.
88 View • 2 Months ago

SQL: Multi field mixed deduplication followed by numbering

The SQL SERVER database table stores personnel records from multiple sources. If one of the Name, Phone, or Email fields in two records is duplicated, it indicates that the two records belong to the same person. Null means that the data is unknown. When both fields of two records are null, it means that they are not duplicated by default. Whether they are duplicated depends on other fields to determine. Note that if A and B are the same, and B and C are the same, then A and C are also the same.

RegisterId	Name	Phone	Email
XXX-00001	John Strauss	241567	Null
XXX-00023	Rick Astley	241567	richardastley@gmail.com
XXX-00003	John Strauss	NULL	NULL
XXX-00099	NULL	241567	georgeharrison@gmail.com
XXX-00085	NULL	256819	richardastley@gmail.com
XXX-00016	NULL	NULL	georgeharrison@gmail.com
XXX-00007	John Deep	280933	NULL
XXX-00008	John Deep	93484	NULL
XXX-00009	Javier Estrada	94578	javier@gmail.com
XXX-00010	NULL	39939	39939@gmail.com

Requirement: Add personnel number 'no' as a calculation column, find duplicate records, and assign an independent 'no' to each set of duplicate records.

RegisterId	Name	Phone	Email	no
XXX-00001	John Strauss	241567		1
XXX-00023	Rick Astley	241567	richardastley@gmail.com	1
XXX-00003	John Strauss			1
XXX-00099		241567	georgeharrison@gmail.com	1
XXX-00085		256819	richardastley@gmail.com	1
XXX-00016			georgeharrison@gmail.com	1
XXX-00007	John Deep	280933		7
XXX-00008	John Deep	93484		7
XXX-00009	Javier Estrada	94578	javier@gmail.com	9
XXX-00010		39939	39939@gmail.com	10

SPL code


1	=mssql.query("select * from tb")
2	=T=A1.derive(#:no)
3	for T.count(T[1:].count( if(no!=T.no && ( (Name && T.Name && Name==T.Name) \|\| (Phone && T.Phone && Phone==T.Phone) \|\| (Email && T.Email && Email==T.Email)), T.no=no=min(no,T.no) ))>0)>0
4	return T

A1: Query the database through JDBC.

A2: Add a new number column 'no', which defaults to record number #.

A3: Use an infinite loop to traverse records and adjust the ‘no’ column. If there is a situation where ‘no’ is adjusted after a certain traversal, traverse again until all ‘no’ are no longer adjusted. Traverse records from top to bottom. When the current record is i-th, compare it with the (i+1) th till the last record in sequence. If it is considered a duplicate record, synchronize the 'no' of both records and take the smaller one between them. When the field value is null, it defaults to not being duplicated with other records. Note that null is false when performing logical AND operations on any value.

The count function returns the number of members that meet the criteria. During the comparison process between the i-th record and the i+1 to the last record, if there is an action where ‘no’ is adjusted, the count of the inner layer will be greater than 0, which will cause the count of the outer layer to also be greater than 0, satisfying the condition for continuing the loop.

We still need the condition of ‘no!=T.no’ to ensure that at least one 'no' value in the adjustment action becomes smaller, otherwise an endless loop may occur.

Problem source: https://stackoverflow.com/questions/78410258/group-rows-which-share-some-value-in-three-columns-and-assign-them-a-same-id

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

Application