Samples of Comparing Files
【Abstract】
This article illustrates the comparison of two text files to find their commons and differences and offers sample programs in esProc SPL. Samples of Comparing Files !
Comparing two text files to find their common or different data is not a rare thing in data analysis work. According to the content to be compared, there are row-wise comparison and comparison by key field(s); and according to the sizes of files under comparison, there are small files comparison and big files comparison. This article illustrates the ways of handling these scenarios and provides sample programs in esProc SPL. esProc is the professional data computing engine based on SPL (Structured Process Language). The language has all-round functions libraries for performing set-based operations. It’s convenient and simple to code a file comparison task in esProc SPL.
1. Small file comparison
1.1 Row-wise comparison
There are two text files. Each row of both files is a string. We want to compare them row by row. To do this, we read in each row of every file as a string to form a set of strings and then perform a set operation.
paint.txt and dance.txt record IDs and names of children who enrolled in painting class and dancing class respectively. Below is part of paint.txt:
20121102-Joan
20121107-Jack
20121113-Mike
1.1.1. Find common data
To find common records of the two files is to get their intersection.
Example: Find all children who enrolled in both painting class and dancing class and write their records to p_d.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/paint.txt").read@n() |
Read each of the rows from paint.txt to form a set |
2 |
=file("e:/txt/dance.txt").read@n() |
Read each of the rows from dance.txt to form a set |
3 |
=file("e:/txt/p_d.txt").write(A1^A2) |
Write their intersection to p_d.txt |
1.1.2 Find differences
There are two types of scenarios of getting differences:
1. Find all different records of the two files.
Example: Find the children who enrolled in either painting class or dancing class.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/paint.txt").read@n() |
Read each of the rows from paint.txt to form a set |
2 |
=file("e:/txt/dance.txt").read@n() |
Read each of the rows from dance.txt to form a set |
3 |
=file("e:/txt/p_d.txt").write(A1%A2) |
Write their XORs to p_d.txt |
2. Find records from each file that doesn’t exist in the other one.
Example: Find children who enrolled in painting class or dancing class only.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/paint.txt").read@n() |
Read each of the rows from paint.txt to form a set |
2 |
=file("e:/txt/dance.txt").read@n() |
Read each of the rows from dance.txt to form a set |
3 |
=file("e:/txt/p_1.txt").write(A1\A2) |
Remove records children from A1 that also exist in A2 to get children who only enrolled in painting class and write them to p_1.txt |
4 |
=file("e:/txt/d_1.txt").write(A2\A1) |
Remove records children from A2 that also exist in A1 to get children who only enrolled in dancing class and write them to d_1.txt |
1.2 Comparison by the key
There are two text files. Each has a number of columns, contains column headers on the first row, and begins detailed data from the second row. To compare key values of the two files, we read in each as a data set, get values of each file’s key to form a set and perform a set operation.
Below are order_2018.txt and order_2019.txt that record orders records of 2018 and 2019 respectively. They have same column structure. Below is part of order_2018.txt:
1.2.1. Find common data
Find records whose key values are equal.
Example: Find customers who bought same products in the two years and write CustomerId and ProductId to c_p.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_2018.txt").import@t(CustomerId,ProductId) |
Read value of the key fields in the orders table of 2018 |
2 |
=file("e:/txt/order_2019.txt").import@t(CustomerId,ProductId) |
Read value of the key fields in the orders table of 2019 |
3 |
=[A1,A2].merge@io() |
Merge A1 and A2 to return their common values using @i option |
4 |
=file("e:/txt/c_p.txt").export@t(A3) |
Write A3’s result to c_p.txt |
Below is part of c_p.txt:
CustomerId ProductId
20108 1
20806 3
1.2.2 Find differences
Example 1: Find records of new customers in 2019 and write them to new_c.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_2018.txt").import@t() |
Import orders table of 2018 |
2 |
=file("e:/txt/order_2019.txt").import@t() |
Import orders table of 2019 |
3 |
=A2.id(CustomerId)\A1.id(CustomerId) |
Remove customer IDs of 2019 from those of 2018 to get the new customer IDs |
4 |
=A2.select(A3.contain(CustomerId)) |
Find records corresponding the new customer IDs in orders table of 2019 |
5 |
=file("e:/txt/new_c.txt").export@t(A4) |
Export records of the new customers to new_c.txt |
Below is part of new_c.txt:
Example 2: Find records of lost customers in 2019 and write them to lost_c.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_2018.txt").import@t() |
Import orders table of 2018 |
2 |
=file("e:/txt/order_2019.txt").import@t() |
Import orders table of 2019 |
3 |
=A1.id(CustomerId)\A2.id(CustomerId) |
Remove customer IDs of 2018 from those of 2019 to get the lost customer IDs |
4 |
=file("e:/txt/lost_c.txt").write(A3) |
Write records of lost customers to Remove customer IDs of 2018 from those of 2019 to get the desired new customer IDs |
2. Big files
Since big files cannot fit into the memory, we need to read data and do the comparison in batches. esProc SPL is equipped with the cursor mechanism to handle the big file processing in a convenient way.
2.1 Row-wise comparison
Here are two big text files. Each row is a string. To do the row-wise comparison, we read in every row of each file as a string with the cursor and then perform set-operations over the cursors.
There is a big file that records information of real property owners in each state, including their IDs and names. Below is part of the file:
510121198802213364-Joan
110113199203259852-Jack
201264197206271113-Mike
2.1.1 Find common data
To find common records of the two files is to get their intersection.
Example: Find records of people who possess real property in both Washington state and New York state and write them to w_n.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/washington.txt").cursor@s().sortx(_1) |
Create cursor for washington.txt and sort it |
2 |
=file("e:/txt/newyork.txt").cursor@s().sortx(_1) |
Create cursor for newyork.txt and sort it |
3 |
=[A1,A2].mergex@i() |
Perform order-based merge of the two cursors; @i option enables getting the intersection, the people who have real property in both states |
4 |
=file("e:/txt/w_n.txt").export(A3) |
Export the result to w_n.txt |
【Note】 If the records in source files are already ordered, sortx() function isn’t needed (The same in the following examples)
2.1.2 Find differences
Example: We have all.txt that records information of real property owners in the whole country, and washington.txt that stores information of real property owners in Washington state only. Now we want to find the owners in Washington state who are not in the nationwide table and write them to lost_w.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/washington.txt").cursor@s().sortx(_1) |
Create cursor for washington.txt and sort it |
2 |
=file("e:/txt/all.txt").cursor@s().sortx(_1) |
Create cursor for all.txt and sort it |
3 |
=[A1,A2].mergex@d() |
Perform order-based merge of the two cursors; @d option enables deletion of A2’s owners from A1 to get those who are not recorded in the nationwide table |
4 |
=file("e:/txt/lost_w.txt").export(A3) |
Export result to lost_w.txt |
2.2 Comparison by the key
To compare big files we need to compare their cursors. First we sort each file by the key and then perform order-based merge to get the desired result.
To make the illustrations simpler, we still use the orders tables (order_2018.txt and order_2019.txt) used in handling small files but increase their sizes.
2.2.1 Find common data
Find records of the two files that have same key values.
Example: Find customers who bought same products in the two years and write CustomerId and ProductId to c_p.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_2018.txt").cursor@t(CustomerId,ProductId) |
Create cursor of the key fields in orders table of 2018 |
2 |
=file("e:/txt/order_2019.txt").cursor@t(CustomerId,ProductId) |
Create cursor of the key fields in orders table of 2019 |
3 |
=A1.sortx(CustomerId,ProductId) |
Sort orders of 2018 by the key fields |
4 |
=A2.sortx(CustomerId,ProductId) |
Sort orders of 2019 by the key fields |
5 |
=[A3,A4].merge@i() |
Merge cursors of A3 and A4; @i enables getting their common records |
6 |
=file("e:/txt/c_p.txt").export@t(A5) |
Write A5’s result to c_p.txt |
2.2.2 Find differences
Example 1: Find records of new customers in 2019 and write them to new_c.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_2018.txt").cursor@t() |
Create cursor of orders table of 2018 |
2 |
=file("e:/txt/order_2019.txt").cursor@t() |
Create cursor of orders table of 2019 |
3 |
=A1.groupx(CustomerId) |
Group A1 by CustomerId to get all unique customer IDs of 2018 |
4 |
=A2.groupx(CustomerId) |
Group A2 by CustomerId to get all unique customer IDs of 2019 |
5 |
=[A4,A3].mergex@d() |
Remove all IDs of 2018 from those of 2019 to get the new customers in 2019 |
6 |
=A2.reset().sortx(CustomerId) |
Reset the cursor of orders table of 2019 and sort it by CustomerId |
7 |
=joinx(A6:new_order,CustomerId;A5:new_c,CustomerId) |
Join orders table of 2019 with the new customer IDs to get orders of the new customers in 2019 |
8 |
=file("e:/txt/new_c.txt").export@t(A7.(new_order)) |
Write the records of new customers to new_c.txt |
Note: The result of executing groupx() function is already ordered by the grouping field CustomerId.
Example 2: Find records of lost customers in 2019 and write them to lost_c.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_2018.txt").cursor@t() |
Create cursor of orders table of 2018 |
2 |
=file("e:/txt/order_2019.txt").cursor@t() |
Create cursor of orders table of 2019 |
3 |
=A1.groupx(CustomerId) |
Group A1 by CustomerId to get all unique customer IDs of 2018 |
4 |
=A2.groupx(CustomerId) |
Group A2 by CustomerId to get all unique customer IDs of 2019 |
5 |
=[A3,A4].mergex@d() |
Id Remove all IDs of 2019 from those of 2018 to get the lost customers |
6 |
=file("e:/txt/lost_c.txt").export@t(A5) |
Write the records of lost customers to lost_c.txt |
3. Comparison of big file and small file
To compare a big file and a small file, we can import the small file into the memory and retrieve data from the big file with the cursor and then perform a join between the database table and the cursor.
3.1 Row-wise comparison
Still use the real property information as the example. The big file records information of real property owners in all states. The smaller file records that of the real property owners in a city.
3.1.1 Find common data
To find common records of the two files is to get their intersection.
Example: Find records of people who possess real property in both Washington state and New York state and write them to w_n.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/washington.txt").cursor@s() |
Create cursor of washington.txt; no need to sort it |
2 |
=file("e:/txt/newyorkcity.txt").read@n() |
Read records of New York city; @n enables reading whole rows to form a sequence |
3 |
=A1.join@i(_1,A2) |
Join A1’s cursor with A2’s sequence; @i enables retaining records in A1 that match A2’s records only |
4 |
=file("e:/txt/w_n.txt").export(A3) |
Export result to w_n.txt |
3.1.2 Find differences
Example: Find the owners in New York city who are not in the New York state table and write them to lost_w.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/newyork.txt").cursor@s() |
Create cursor of washington.txt; no need to sort it |
2 |
=file("e:/txt/newyorkcity.txt").read@n() |
Read records of New York city; @n enables reading whole rows to form a sequence |
3 |
=A1.join@i(_1,A2) |
Join A1’s cursor with A2’s sequence; @i enables retaining records in A1 that match A2’s records only |
4 |
=A2\A3.(_1) |
Remove their common records from A2 to get the lost records |
5 |
=file("e:/txt/lost_w.txt").export(A4) |
Export result to lost_w.txt |
3.2 Comparison by the key
order_2019.txt records orders of 2019 and it is small. order_old.txt stores orders before 2019 and it is big.
3.2.1 Find common data
Example: Find customers who bought same products in the year of 2019 and before and write CustomerId and ProductId to c_p.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_old.txt").cursor@t(CustomerId,ProductId) |
Create cursor for the key fields of order_old.txt |
2 |
=file("e:/txt/order_2019.txt").import@t(CustomerId,ProductId) |
Import values of key fields of order_2019.txt |
3 |
=A2.group@1(CustomerId,ProductId) |
Remove duplicates from the orders of 2019 |
4 |
=A1.join@i(CustomerId:ProductId,A3:CustomerId:ProductId) |
Join A1’s cursor with A3’s sequence by the key fields; @i enables retaining matching records only |
5 |
=A4.fetch().group@1(CustomerId,ProductId) |
Get records from the joining result and delete deuplicates |
6 |
=file("e:/txt/c_p.txt").export@t(A5) |
Write A5’s result to c_p.txt |
3.2.2 Find differences
Example 1: Find records of new customers in 2019 and write them to new_c.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_old.txt").cursor@t() |
Create cursor for order_old.txt |
2 |
=file("e:/txt/order_2019.txt").import@t() |
Import data of order_2019.txt |
3 |
=A1.groupx(CustomerId) |
Group order_old.txt to get all customer IDs |
4 |
=A2.id(CustomerId) |
Remove records having duplicate IDs from A2 |
5 |
=A3.join@i(CustomerId,A4).fetch() |
Join A3 and A4; @i enables retaining customers that are included in the orders table of 2019 |
6 |
=A4\A5.(CustomerId) |
Get new customers by removing A5’s customer IDs from the customer IDs of 2019 |
7 |
=A2.select(A6.contain(CustomerId)) |
Get records of 2019 containing the new customer IDs |
8 |
=file("e:/txt/new_c.txt").export@t(A7) |
Export new orders to new_c.txt |
Example 2: Find records of lost customers in 2019 and write them to lost_c.txt.
esProc SPL script:
A |
Comment |
|
1 |
=file("e:/txt/order_old.txt").cursor@t(CustomerId) |
Create cursor for customer IDs in order_old.txt |
2 |
=file("e:/txt/order_2019.txt").import@t(CustomerId) |
Get customer IDs from orders of 2019 |
3 |
=A2.id(CustomerId) |
Get unique customer IDs of 2019 |
4 |
=A1.select(!A3.contain(CustomerId)) |
Get customer IDs from order_old.txt that are not included in customers of 2019 |
5 |
=A4.groupx(CustomerId) |
Delete duplicate customer IDs from A4 |
6 |
=file("e:/txt/lost_c.txt").export@t(A5) |
Write records of lost customers to lost_c.txt |
Find more examples in SPL CookBook.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version