How to Parse a Large Text File Fast

Question

What’s the most efficient way to parse a large text file?

I’m trying to optimize some code. I have to open a large text file, match each line against a regular expression, and then process the results.

I’ve tried the simple approaches:

 

 for line in my_file: 

match = my_regx.match(line) 

process(match.groups() ) 

 

and 

 

data = my_file.read().splitlines() 

for line in data: 

# etc. 

Neither is terribly speedy. Does anyone have a better method?

 

Answer

The multithreaded parallel processing can make the matching faster. But the Python way of coding parallel processing is complicated. Moreover, it’s a headache to segment the file by bytes. SPL (Structured Process Language) can make all these much easier. It divides file1.txt into multiple segments, gives each thread a segment to match with the regular expression to get the eligible rows, and then concatenate them to export to a text file. Below is the SPL script:

A

1

=file("D:\\file1.txt")

2

=A1.cursor@m(;4).(~.array().concat())

3

=A2.regex(".*smile.*")

4

=file("D:\\result.txt").export(A3)

The regular expression is versatile yet inefficient. If the matching rule is not complicated, we can use like function to speed up the process. For example:

A

1

=file("D:\\file1.txt")

2

=A1.cursor@m(;4)

3

=A2.select(like(#1,"*smile*"))

4

=file("D:\\result.txt").export(A5)

esProc SPL is equipped with a rich library of functions to achieve various algorithms, including grouping & aggregation, ranking & sorting, associated operations, multi-file query and merge query, etc.