Get Rows Having Values That Included in a Column in Another Data Set
【Question】
I need to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab-delimited):
some\_data more\_data other\_data here yet\_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt:
Gee
Whiz
Lol
Ideally output.txt looks like:
some\_data more\_data other\_data here yet\_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script:
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as ‘Whizz’ contains ‘Whiz’.
I also tried cat Data.txt | egrep -F “List.txt” and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual List.txt contains a sorted list of 985 words, and Data.txt has 115576 rows with 17 columns.
【Answer】
To get your task done, you can convert the here field in Data.txt into a set and then perform set operation over it and List.txt. But it’s a little complicated to handle set operations in Shell. Here I do this in SPL (Structured Process Language):
A |
|
1 |
=file("/Data.txt").import@t() |
2 |
=file("/List.txt").read@n() |
3 |
=A1.select(here.array(";")^A2!=[]) |
The “^” sign represents getting intersection; and “[]” represents an empty set.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL