Filtering Based on Two Files
【Question】
I am trying to make a comparison between two large tab-delimited files. I have been trying to use Awk & Bash (Ubuntu 15.10), Python (v3.5) and PowerShell (Windows 10). My only background is Java but my field tends to stick with the scripting languages.
File 1 A[ ]
1 gramene gene 4854 9652 - ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized protein \[Source:UniProtKB/TrEMBL%3BAcc:C0P8I2\];gene_id=GRMZM2G059865;logic_name=genebuilder;version=1
1 gramene gene 9882 10387 - ID=gene:GRMZM5G888250;biotype=protein_coding;gene_id=GRMZM5G888250;logic_name=genebuilder;version=1
1 gramene gene 109519 111769 - ID=gene:GRMZM2G093344;biotype=protein_coding;gene_id=GRMZM2G093344;logic_name=genebuilder;version=1
1 gramene gene 136307 138929 + ID=gene:GRMZM2G093399;biotype=protein_coding;gene_id=GRMZM2G093399;logic_name=genebuilder;version=1
File 2 B [ ]
S1_6370 T/C 1 6370 +
S1_8210 T 1 8210 +
S1_8376 A 1 8376 +
S1_9889 A 1 9889 +
Output:
1 ID=gene:GRMZM2G059865 4857 9652 - S1_6370 T/C 6370 +
1 ID=gene:GRMZM2G059865 4857 9652 -S1_8210 T 8210 +
1 ID=gene:GRMZM2G059865 4857 9652 -S1_8376 A 8376 +
1 ID=gene:GRMZM5G888250 9882 10387 -S1_9889 A 9889 +
My general logic:
loop (until end of A\[ \] and B\[ \])``if``B\[$4\]>A\[$4\] && B\[$4\]<A\[$5\] #if the value in B column 4 is in between the values in A columns 4 & 5.``then``-F”\\t” print {A\[1\], A\[9(filtered)\], A\[$4FS$5\], B\[$1\], B\[$2\], B\[$3\], B\[$4\], B\[$5\]} #hopefully reflects Awk column calls if the two files were able to have their columns defined that way.``movea++ \# to see if the next set of B column 4 values is in between the values in A columns 4 & 5 ``else``moveb++ #to see if the next set of A columns 4&5 values contain the current vales of B column 4 in them.
I know this logic doesn’t follow any language that I am aware of but is similar in parts. It seems like NR and FNR are two built-in running values in Awk. Awk helped me split up File 2 that has 10 values in B[$1] into 10 files quite easily and also cut helped with cutting out the few hundred columns (~255+) beyond the 5 you see here. Now I am working File 2 sizes around a couple of MB instead of 1 file of 1.6 GB. Other than cutting down loading times, I wanted to simplify the loops. I haven’t backtracked to my previous attempts of Python or PowerShell since I trimmed the file sizes down. I convinced myself they just weren’t going to read my files with their built-in libraries or cmdlets, which I’ll try sometime soon if I am unable to figure out an Awk solution.
Comparing multiple files and columns using Awk referenced Awk greater than less than but within a set range referenced efficiently splitting one file into several files by value of column. The one thing that worked using Awk to get a specific string in line might be able to filter column 9 How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix? #this seemed the closest but without all the printing out in a third file I wanted, still not able to figure out the syntax completely.
Below is a solution:
awk '
BEGIN{
x=getline s <"B"
split(s,b,"\\t")
}
!x {
exit
}
{
sub(/;.*/,"",$9)
while (x && $4<b\[4\] && b\[4\]<$5) {
print $1,$9,$4,$5,$7,b\[1\],b\[2\],b\[4\],b\[5\]
x=getline s <"B"; split(s,b,"\\t")
}
}
'OFS='\\t' A
【Answer】
You can speed up the retrieval of smaller File2 by importing it into the memory. And you can batch retrieve the bigger File1 with the cursor. Import each of them line by line and then use for loop to query desired values by column number and append them to a file.
It’s rather complicated to code it in Java. The Awk code is simple but slow and isn’t suitable to handle a large file. An easy solution is SPL. The Structured Process Language enables simple and efficient code and supports execution from command line and integration with Java. Here’s the SPL script:
A |
B |
|
1 |
=file("file2.txt").import() |
|
2 |
=file("file1.txt").cursor() |
|
3 |
for A2 |
=A1.select(#4>A3.#4 && #4> A3.#5) |
4 |
=B3.new(A3.#1,A3.#7.split(";")(1),A3.#4,A3.#5,"-",#1,#2,#4,#5) |
|
5 |
=file("D:/result.txt").export@a(B4) |
A3: Loop over records in A2’s cursor.
B3: Get eligible records from file1 and file2.
B4: Get field values by column number to generate a new table sequence.
B5: Export the query result to a file.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL