Text File Grouping & Aggregation
【Question】
I have a file in the following format:
45 a
36 a
58 a
21 a
31 b
12 b
23 b
21 c
32 c
45 c
…
I need to first count lines containing value a, output the result to a target file and write values of the first field in these lines after that. Then I go on to handle lines containing value b and then, lines containing value c, in the same way.
Expected result:
a 4
45
36
58
21
b 3
31
12
23
c 3
21
32
45
…
What I did is to traverse the file to count lines respectively containing a, b, and c, hold the results in a temporary file, and then re-traverse the file and read the temporary file to write the desired values to target file in the specified format. I am wondering if there’s any simple method to do this, like traversing the source file once only.
【Answer】
It’s simple grouping & aggregation. High-level languages lack class libraries that handle SQL-like calculations. So it’s very inconvenient to write the code directly based on a text file. It’s easy to handle such a calculation in SPL (Structured Process Language), however.
If your file can fit into the memory, we just need to traverse it once. Below is the SPL script:
A |
B |
|
1 |
=file("E:\\source.txt") |
=file("E:\\result.txt") |
2 |
=A1.import().group(_2) |
|
3 |
for A2 |
>B1.write@a(A3._2+"\t"+string(A3.len())) |
4 |
>B1.export@a(A3,_1) |
A1,B1: Open the to-be-handled file and the target export file.
A2: Import source.txt and group and sort records by the second column.
A3-B4: Loop through each member of A2’s sequence (which is A3’s current value) to count its records, generate a new record to write it to result.txt, and export values of A3’s first column to result.txt.
If the file is too large to be loaded into memory, you need a temporary file to hold the intermediate result. SPL has the built-in mechanism to auto-generate a temporary file. The script for handling a large file is similar:
A |
B |
|
1 |
=file("E:\\source.txt") |
=file("E:\\result.txt") |
2 |
=A1.cursor().sortx(_2) |
|
3 |
for A2;_2 |
>B1.write@a(A3._2+"\t"+string(A3.len())) |
4 |
>B1.export@a(A3,_1) |
A1,B1: Open the to-be-handled file and the target export file.
A2: Import source.txt as a cursor and sort its records by the second column. sortx() function will generate a temporary file to hold the sorting result.
A3-B4: Loop through A2’s sequence to read in records having same second column value at a time, generate a new record to write it to result.txt, and export values of A3’s first column to result.txt.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL