Group & Aggregate & Union
【Question】
I am trying to select top values in a column based on the variable/field in another column. It is a very large tab delimited file.
Input:
Names col2 col3 col4
A A1 def 10
A A1 BBB 10
A A1 CED 10
A A1 fff 7.5
B B1 def 7.5
B B1 dff 5.5
B B1 fff 4.5
C C1 ggg 6.5
C C3 iii 6.5
C C2 hhh 5.0
C C4 toi 6.5
D D1 xyz 10.0
D D2 ikj 7.5
D D3 abc 7.5
...
Output:
Names col2 col3 col4
A A1 def 10
A A1 BBB 10
A A1 CED 10
B B1 def 7.5
C C1 ggg 6.5
C C3 iii 6.5
C C4 toi 6.5
D D1 xyz 10.0
Basically, I want all the rows with values 10 and the top values for each of the names in column1. Any inputs to solve this by Perl, Awk or sed are well appreciated. Thanks.
【Answer】
Your question is one of the typical structured computations. Shell takes a roundabout route to do this. A faster and more direct way is using SPL (Structured Process Language):
A |
|
1 |
=file("file.txt").import@t() |
2 |
=A1.group(Names) |
3 |
=A2.((a=~.max(col4),~.select(col4==a|| col4==10))) |
4 |
=A3.union() |
A1: Read in content of file.txt.
A2: Group records by Names.
A3: A2.(…) runs a loop to process each group of data in A2. (a=~.max(col4),~.select(col4==a|| col4==10)) calculates expression in order and returns result of the last expression. a=~.max(col4) returns the top values in col4 in each group to variable a. ~.select(col4==a|| col4==10) selects records where col4 value is a or 10, which are the final result.
A4: Union groups in A3.
You can also execute the SPL script at the command line under Unix/Linux.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL