Key Associations between Multiple Files
【Question】
I want to match column 1 of file 1 with column 1 of file 2 and then column 2 of file 1 with column 1 of file 3 and then print the matches. The columns in the files are separated by tabs. For example:
file 1:
fji01dde AIDJFMGKG
dlp02sle VMCFIJGM
cmr03lsp CKEIFJ
file 2:
fji01dde 25 30
dlp02sle 40 50
cmr03lsp 60 70
file 3:
AIDJFMGKG
CKEIFJ
Output:
fji01dde AIDJFMGKG 25 30
cmr03lsp CKEIFJ 60 70
I only want lines that are common in all three files.
The below code works well for the first two files, but I need to incorporate the third file. Any ideas?
#!/usr/bin/env perl
use strict;
my (%file1,%file2);
## Open the 1st file
open(A,"file1");
while(<A>){
chomp;
## Split the current line on tabs into the @F array.
my @F=split(/\t/);
push @{$file1{$F[0]}},@F[1..$#F];
}
## Open the 2nd file
open(B,"file2");
while(<B>){
chomp;
## Split the current line on tabs into the @F array.
my @F=split(/\t/);
if (defined($file1{$F[0]})) {
foreach my $col (@{$file1{$F[0]}}) {
print "$F[0]\t$col\t@F[1..$#F]\n";
}
}
}
What would the output be if file3 did not have CKEIFJ
It would be just fji01dde AIDJFMGKG 25 30
A solution:
use strict;
use warnings;
use Text::CSV_XS qw(csv);
my @csv_files = @ARGV;
# Parse all the CSV files into arrays of arrays.
my $data1 = csv(in => $csv_files[0], sep_char => "\t" );
# Parse the other CSV files into hashes of rows keyed on the columns we're going to search on.
my $data2 = csv(in => $csv_files[1],
sep_char => "\t",
headers => ["code", "num1", "num2"\],
key => "code"
);
my $data3 = csv(in => $csv_files[2],
sep_char => "\t",
headers => \["CODE"],
key => "CODE"
);
for my $row1 (@$data1) {
my $row2 = $data2->{$row1->[0]};
my $row3 = $data3->{$row1->[1]};
if($row2 && $row3) {
print join "\t", $row1->[0], $row1->[1], $row2->{num1}, $row2->{num2};
print "\n";
}
}
【Answer】
This is a typical structured computation. The relationship between file1’s column1 and file2’s column1is a left join; and that between file1’s column2 and file3’s column1 is also a left join. A language that supports structured algorithm can describe the relationship clearly and perform the joins easily. Yet Perl lacks related functions. It is able to achieve this but the code is hard to understand.
Here we handle your question with SPL (Structured Process Language). The code is simple and easy, as shown below:
A |
|
1 |
=file("D:/file1.txt").import() |
2 |
=file("D:/file2.txt").import() |
3 |
=file("D:/file3.txt").import() |
4 |
=A1.switch(#1,A2:#1) |
5 |
=A3.switch(#1,A1:#2) |
6 |
=A5.new(#1.#1.#1, #1.#2, #1.#1.#2, #1.#1.#3) |
A4-A5: Create table associations.
A6: Retrieve data by the associated fields. #1.#1.#1 represents file3’s column1. It points to file1’s column1 and then file2’s column1.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL