Grouping Duplicates in CSV file and Ranking Data Based on Certain Values
Question
I have a CSV file as follows:
"user_id","age","liked_ad","location"
2145,34,true,USA
6786,25,true,UK
9025,21,false,USA
1145,40,false,UK
The file goes on. There are duplicate user_id values within the file, so what I am trying to do is to find out which users have the most 'true' answers for the 'liked_ad' column. I am super stuck on how to do this in Java and would appreciate any help.
This is what I have so far to literally just parse the file:
public static void main(String[] args) throws FileNotFoundException
{
Scanner scanner = new Scanner(new File("src/main/resources/advert-data.csv"));
scanner.useDelimiter(",");
while (scanner.hasNext()) {
System.out.print(scanner.next() + "|");
}
scanner.close();
}
I'm stuck on where to go from here in order to achieve what I am trying to achieve.
Answer
Retrieve the CSV file, group it by user_id, count records whose third column is true in each group, find groups where the count is greater than 0, and then sort records by the count in descending order. The code will be lengthy if you try to code the process in Java.
I suggest you using SPL, the open-source Java package to do this. It is simple and only one line of code is enough:
A |
|
1 |
=file("advert-data.csv").import@cqt().groups(user_id;count(#3==true):count).select(#2>0).sort(-#2) |
SPL offers JDBC driver to be invoked by Java. Just store the above SPL script as rank.splx and invoke it in Java as you call a stored procedure:
…
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
st = con.prepareCall("call rank()");
st.execute();
…
View SPL source code.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version