Group & Aggregate with SPL
【Question】
I have a CSV file with the following values:
#BOF
userID;gender;movieID;rating
1;m;100;50
1;m;101;100
1;m;102;0
2;f;100;100
2;f;101;80
3;m;104;70
4;m;104;80
5;f;100;75
#EOF
I want to know how many movies does each user rate? Assume that there are hundreds of thousands of users. I tried to code it in Eclipse for Java using:
while ((strLine = br.readLine()) != null) {
String[] strings = strLine.split(";");
But then it stopped. I am new at this so it probably looks easy, but not for me yet.
【Answer】
It’s inconvenient to code group & aggregate in Java because the high-level language doesn’t offer corresponding functions. Here I get it done with SPL (Structured Process Language):
A |
|
1 |
=file("d:\\source.csv").read@n() |
2 |
=A1.to(2,A1.len()-1) |
3 |
=A2.concat("\n") |
4 |
=A3.import@t(;";") |
5 |
=A4.groups(userID;count(movieID)) |
A1: Read in the contents of source.csv and return the lines as a sequence of strings; each line is a member.
A2: Retrieve rows from the second to the second-to-last from A1’s table.
A3: Join members of A2’s sequence into a string with the delimiter “\n”.
A4: Import A3’s string into multiple records by the delimiter and return them as a table sequence.
A5: Group records by userID and count records in each group.
The SPL script can be easily integrated into a Java application. See How to Call an SPL Script in Java for more details.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL