Fast Read & Group over a CSV in SPL
【Question】
I am writing a script in Perl, but got stuck in one part. Below is a sample of my CSV file.
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G",
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G",
"MP","918120197922","20150806125004","prepaid","prepaid","2G",
"MUM","919904303790","20150806125005","prepaid","prepaid","2G","3G"
"MUM","918652624178","20150806125005","prepaid","prepaid","2G","3G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
Now I need to take unique records on the basis of 2nd column (i.e. mobile numbers) but considering only the latest value of 3rd column (i.e. timestamp). Eg: for mobile number "918120197922":
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
It should have selected the 3rd record as it has the latest value of timestamp (20150806125005). Please help.
Additional Info: Sorry for inconsistency in data. I have rectified it now. Yes data is in order, which means latest timestamp will appear in the latest rows. One more thing, my file has the size of more than 1 GB so is there any way to do this efficiently? Will Awk work faster than Perl in this case?
【Answer】
The logic is simple, that is, to find the record having the latest value of timestamp (maximum value). Sine text parsing is slow you’d better do it with multiple threads. Besides, a hash method will make the grouping faster than a traverse of all records. It’s complicated to do this in Perl. So I choose to handle your question in SPL. The Structured Process Language generates simple and easy code:
A |
|
1 |
=file("d:\\source.csv").cursor@qmc() |
2 |
=A1.groups(#2;top(-1;#3):a) |
3 |
=A2.(a).conj() |
4 |
=file("d:\\result.csv").export@c(A3) |
A1: Read in content of the source.csv without quotations and return a multicursor.
A2: Group each subcursor of A1’s multicursor by the 2nd column and then get the record from each group containing the maximum value in the 3rd colulmn.
A3: Concatenate the selected records from all groups.
A4: Export A3’s result set to the target file result.csv.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL