Union Lines of a Certain Number from a Filegroup
【Question】
I have a couple of directories that include tens of thousands of html, txt, csv and other txt-based files. I want to output a particular line number from each one as a result file. I need to do this as efficiently/quickly/easily as possible on a windows7 machine. My problem is HOW?
I am currently using textpad and notepad++’s "find in files" for a particular string that is in every file on the same line number but I think that there should be a tool out there that can give me the same result more efficiently/quickly by simply going straight for the same line number in the files (200k) in the subdirectories.
I am trying to extract all the line#123 from each of the 200k files and put those lines into a new text file.
I don't need to replace or edit...
There are multiple folders with 10k to 200k files in each. That is one of the things I am avoiding... Opening those folders and subfolders as even with 16gb ddr3 on dual quadcore is too slow/error prone and resource intensive.
【Answer】
The algorithm is clear: Perform iteration to get files from each subdirectory to read into data from each file; output result once a subdirectory is read through. Command line execution is difficult. Advanced languages can achieve this but the code is difficult to write, especially when there are large files. It’s easy to achieve this in SPL (Structured Process Language) because it supports reading a large file with the cursor and calling the script iteratively. Here’s the SPL script:
A |
B |
|
1 |
=directory@p(path) |
|
2 |
=A1.(file(~).cursor@s()) |
|
3 |
=A2.((~.skip(122),~.fetch@x(1))) |
|
4 |
=A3.union() |
|
5 |
=file("d:\\result.txt").export@a(A4) |
|
6 |
=directory@dp(path) |
|
7 |
if A6.len()==0 |
|
8 |
return |
|
9 |
else |
|
10 |
=A6.(call("c:\\readfile.dfx",~)) |
A1: Get files under the current directory through parameter path; the initial value of the parameter is root directory.
A2: Open each file in A1 via the cursor to reduce memory usage. A1.(…) calculates A1’s members one by one; ~ represents the current member; file() function creates a file object.
A3: For each file cursor in A2, skip the first 122 lines to read line 123. A2.(…) calculates each cursor in A2. (~.skip(122),~.fetch@x(1)) calculates the expressions included in the outer parentheses in order and returns result of the last expression; ~.skip(122) skips the first 122 lines; ~.fetch@x(1) reads a line at the current position (i.e. line 123) and close the cursor; @x option automatically closes cursor after data is fetched; the result of calculating ~.fetch@x(1) is what the outer parentheses return.
A4: Union results of calculating A2’s cursors.
A5: Write A4 to the target file result.txt.
A1-A5 extracts files from the current directory. Now we just need to get the subdirectories and call the script iteratively.
A6: Get the list of subdirectories under the current directory. directory() function gets all subdirectories under the current directory; @d option gets the subdirectory names and @p gets full paths.
A7-B8: Return if no subdirectory is found.
A9-B10: Call the script iteratively to process each subdirectory in A6. The algorithm is to call the SPL script c:\\readfile.dfx (SPL script name) and pass the current subdirectory as the input parameter.
SPL script extracts the multilevel directories in parameter path through the iterative call.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL