An Easy Way of Handling Large Text Files with Parallel Processing
Key words: Large text file Parallel processing
Though the multicore CPU in contemporary computers offers hardware strength to speed up large file processing with parallel processing, writing a parallel program with a programming language is not easy.
Parallel processing means division of the source file and each thread handles a part. In a text file usually a line makes a record. But the lengths of lines may vary. So division by number of lines is infeasible because each division action requires a traversal from the beginning, which can compromise performance gains. Division by bytes doesn’t need traversal but it triggers another problem. The breaking point of a subsection may happen to fall in a line and thus the line will be split and put into different sections. This will lead to data inconsistency. A solution to this problem is a segmentation method that automatically restoring the head line to the previous section. That is, the ending line of a section will be wholly retained and the beginning line of a section will be given up. This method ensures that each section covers complete lines and that data is always consistent.
Threads control and management is also a problem. Mismanagement always results in out-of-bound error.
No more division and threads management problems if we could use esProc SPL to do the job. The Structured Process Language encapsulates multithreaded algorithm to produce short and easy to understand program. It brings high performance while enabling programmers to focus more on overall computational than being distracted by technical details. Below is an example of SPL parallel processing program:
A |
B |
C |
|
1 |
=file(“data.txt”) |
/Source file |
|
2 |
fork 4 |
=A1.cursor@t(amount;A2:4) |
/Divide the file into 4 sections and create cursor on them |
3 |
=B2.groups(;sum(amount):am) |
/Traverse cursor to sum amounts |
|
4 |
=A2.conj().sum(am) |
/Concatenate results of threads and calculate total |
Often it takes much longer to parse a file than to process it. So parallelly processing the parsing takes priority. SPL provides a built-in option to retrieve data with parallel processing. Writing code for order-irrelevant operations, such as grouping and sum, thus becomes rather easy:
A |
B |
|
1 |
=file("orders.txt").cursor@mt() |
/@m option auto-chooses the number of multiple threads according to system configurations |
2 |
=A1.select(month(Date)==10) |
/Filtering |
3 |
=A2.groups(ID;sum(COST*WEIGHT):VALUE) |
/Group and aggregate with serial processing |
In real-world businesses, there are a lot of large file processing scenarios. You can always handle them conveniently with esProc SPL. More examples can be found in Structured Text Computations with esProc.
esProc is the file processor that can conveniently handle data loading, database export and mixed computations over various types of files, including TXT, Excel, XML, JSON, CSV and INI. The desktop tool is ready to use, simple to configure and convenient to debug. It allows setting a breakpoint and step-by-step execution during which you can view the result of each step. Based on powerful yet simple syntax that agrees with human way of thinking, esProc is more convenient to use compared with high-level languages. Read Data File Processor to learn details.
SPL is integration-friendly with a Java program. Read How to Call an SPL Script in Java to learn details.
About how to work with esProc, read Getting Started with esProc.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL