"Here are two csv files (A and B) of same structure. Both use KEY_A, KEY_B and KEY_C as the prima .."

jinxing RaqForum 41 No.
150 View • 4 Months ago

Compare large csv files

Here are two csv files (A and B) of same structure. Both use KEY_A, KEY_B and KEY_C as the primary key, and the two have different records.

A.csv

B.csv

KEY_A,KEY_B,KEY_C,V1,V2

a0,b0,c0,v0,v0

a1,b1,c1,v1,v2

a2,b2,c2,v2,v200

a4,b4,c4,v400,v4

a3,b3,c3,v3,v3

m1,p1,q1,v1,v1

m1,p2,q2,v1,v1

m1,p1,q2,v1,v1

m1,p2,q1,v1,v1

KEY_A,KEY_B,KEY_C,V1,V2

a1,b1,c1,v1,v2

a2,b2,c2,v200,v200

a3,b3,c3,v300,v3

a4,b4,c4,v4,v400

a,b,c,v,v

m1,p1,q1,v1,v1

Use Java to compare the two files in three ways and write the results respectively to new csv files. 1. Find records where primary keys are equal and values of the other fields are not equal, output their primary keys and then the other fields of A and the other fields in B. Below is the expected result:

KEY_A,KEY_B,KEY_C,V1,V2,V1,V2

a2,b2,c2,v2,v200,v200,v200

a3,b3,c3,v3,v3,v300,v3

a4,b4,c4,v400,v4,v4,v400

2. Find difference of A and B, that is, records existing in A but not existing in B, according to the primary key. Below is the expected result:

KEY_A,KEY_B,KEY_C,V1,V2

a0,b0,c0,v0,v0

m1,p1,q2,v1,v1

m1,p2,q1,v1,v1

m1,p2,q2,v1,v1

3. Find difference of B and A according to the primary key. Below is the expected result:

KEY_A,KEY_B,KEY_C,V1,V2

a,b,c,v,v

Write SPL code to do the three comparisons. Below is for the 1^st comparison:

		A
1	=T@c("A.csv")	=T@c("B.csv")
2	=A1.sortx(KEY_A,KEY_B,KEY_C)	=B1.sortx(KEY_A,KEY_B,KEY_C)
3	=joinx(A2,KEY_A,KEY_B,KEY_C;B2,KEY_A,KEY_B,KEY_C)
4	=A3.select(#1.V1!=#2.V1 \|\| #1.V2!=#2.V2)
5	=A4.new(#1.KEY_A,#1.KEY_B,#1.KEY_C,#1.V1,#1.V2,#2.V1,#2.V2)
6	=T("r1.csv":A5)

For the 2^nd comparison (Code for the 3^rd comparison is similar and we just skip it):

		A
1	=T@c("A.csv")	=T@c("B.csv")
2	=A1.sortx(KEY_A,KEY_B,KEY_C)	=B1.sortx(KEY_A,KEY_B,KEY_C)
3	=[A2,B2].merge@d(KEY_A,KEY_B,KEY_C)
4	=T("r2.csv":A3)

T()function parses a csv file or write data to a csv file; @c option enables using the cursor to retrieve data from a file that cannot fit into the memory. sortx() function sorts data in a cursor. joinx()function performs a merge-join. merge() function merges records; @d option enables finding the diffefrence.

Read How to Call a SPL Script in Java to find how to integrate SPL into a Java application.
Source:https://stackoverflow.com/questions/73748916/iterating-massive-csvs-for-comparisons

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

Application