Binary Search over an Ordered File
【Question】
I am working with a relatively large text file (70GB uncompressed, 15GB zipped) which contains 3 columns. The lines of the file are of the following form:
x1 | y1 | a1
x1 | y2 | a2
x2 | y3 | a3
x3 | y4 | a4
The x and y are sequences of words that contain between 1 and 4 words. The strings in the first column are sorted and not unique. The strings in the second column are not unique and sorted for a same element in the first column.
There are about 700,000,000 lines in the uncompressed text file. What I want is, for a query of tuples (x, y), get value a in the third column. I need to access it in a time as short as possible.
I tried to create 2 Dictionaries (strings, List of integers). The first dictionary maps a string to the index of the lines that contain this string in the first column, and the same for the second dictionary and the second column. Then for a query (x, y) I can intersect these two lists and get the line that contains “x | y | a”. I can then use a dictionary that maps a line number to its offset in the file and use random access file to read the line.
The problem is that this requires way too much memory (maybe it’s also because I’m using Java!). I am looking for a solution that can query the text file very quickly, but doesn’t require more than 20 / 30 GB of ram.
I guess there are methods to do this kind of things but I’m not familiar with them. Is there any ideas?
【Answer】
Using binary search can locate a record in a sorted large file very quickly. It’s complicated to code the searching process in Java because this may involve handling half records. For example, for a file of 70G, half of it may fall in a byte that is in the middle of a record. You must take the second half of the record from the second part of the file in order to correctly locate the record. You can handle the half-record in SPL (Structured Process Language) in intuitive and easy to understand code:
A |
B |
|
1 |
=file("D:\\filebig.txt") |
=[argX,argY] |
2 |
=A1.iselect@t([B1],[#1,#2];;"|").fetch@x() |
In the above script, both argX and argY are input parameters representing the first fields. #1 and #2 represent the 1st field and the 2nd field of a record. iselect function is used to locate a record using binary search.
About file cursor, you can refer to Text Files.
The SPL script can be called from a Java application through JDBC. See How to Call an SPL Script in Java.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL