How to Compare Two Large CSV Files in Java
Question
Source:https://stackoverflow.com/questions/69357566/how-to-compare-two-large-csv-file-in-java
I need to compare two large csv files and find differences.
First CSV file will be like:
c71f55b6c18248b8915d8a26
64b7d2d4eab74d7999a967c0
ceb792ad21054fe0a27ec410
95319566f9424c57ba2145f9
682a4fe26c154050b8f5c6f1
88e0209e2af74049ad9bf2bd
5c462b42763d41d7bb67029f
0ee74c227fc84e39a9ecc1da
66f7ab6f56374ba08d2fb92d
3ed793e35f9441b58562c9ba
baad81ac8ba54188afe63fb8
...
Each row has just one id, and total row count is approximately 5 million. The Second CSV file will be like First one with total row count 3 million.
I need to remove ids of the second csv from the first csv and put them into a MongoDB. When I take all lines into memory then compare both CSV files, I got out of memory error. I have 512Mb memory space and I will get at least 30 requests in a day. Rows of CSV is changing 1Million-10Million. I can receive two requests at same time and do same things simultaneously.
Is there any other way on this?
Thanks.
Answer
You need to delete data from the first CSV file that also exist in the second CSV file. As both CSVs are very large, they cannot be wholly loaded into the memory. Java will produce a very long piece of code to do this.
It is rather simple to get this done in SPL, the open-source Java package. Only one line of code is sufficient:
A |
|
1 |
=file("result.csv").export([file("csv1.csv").cursor@i().sortx(~),file("csv2.csv").cursor@i().sortx(~)].mergex@d()) |
SPL offers JDBC driver to be invoked by Java. Just store the above SPL script as diff.splx and invoke it in Java as you call a stored procedure:
…
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
st=con.prepareCall("call diff()");
st.execute();
…
Or execute the SPL string within a Java program as we execute a SQL statement:
…
st = con.prepareStatement("==file(\"result.csv\").export([file(\"csv1.csv\").cursor@i().sortx(~),file(\"csv2.csv\").cursor@i().sortx(~)].mergex@d ())");
st.execute();
…
View SPL source code.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese verson