How to Parse a Large Text File Fast
【Question】
What’s the most efficient way to parse a large text file?
I’m trying to optimize some code. I have to open a large text file, match each line against a regular expression, and then process the results.
I’ve tried the simple approaches:
for line in my_file:
match = my_regx.match(line)
process(match.groups() )
and
data = my_file.read().splitlines()
for line in data:
# etc.
Neither is terribly speedy. Does anyone have a better method?
【Answer】
The multithreaded parallel processing can make the matching faster. But the Python way of coding parallel processing is complicated. Moreover, it’s a headache to segment the file by bytes. SPL (Structured Process Language) can make all these much easier. It divides file1.txt into multiple segments, gives each thread a segment to match with the regular expression to get the eligible rows, and then concatenate them to export to a text file. Below is the SPL script:
A |
|
1 |
=file("D:\\file1.txt") |
2 |
=A1.cursor@m(;4).(~.array().concat()) |
3 |
=A2.regex(".*smile.*") |
4 |
=file("D:\\result.txt").export(A3) |
The regular expression is versatile yet inefficient. If the matching rule is not complicated, we can use like function to speed up the process. For example:
A |
|
1 |
=file("D:\\file1.txt") |
2 |
=A1.cursor@m(;4) |
3 |
=A2.select(like(#1,"*smile*")) |
4 |
=file("D:\\result.txt").export(A5) |
esProc SPL is equipped with a rich library of functions to achieve various algorithms, including grouping & aggregation, ranking & sorting, associated operations, multi-file query and merge query, etc.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL