Get Top 3 from Each Hive Group
Apache Hive uses window function and subqueries to handle top N problems. It’s difficult to write the program:
SELECT *
FROM (
SELECT g, ROW_NUMBER() OVER(PARTITION BY g
ORDER BY x) group_id RN FROM A
)
WHERE group_id <= 3
ORDER BY g ASC, group_id ASC
It’s simple and easy to do this in SPL (Structured Process Language). Only a one-liner is sufficient:
A.group(g;~.top(-3;x)).conj(#2)
* To get rows where a specified field value is greater than the median from each group;
* To get the maximum growth rate in each group;
* To get groups where values of a specified field rises consecutively;
SPL is good at handling these types of calculations over subsets or detailed data in a group. More explanations and examples can be found in Getting Top N Problem Optimization in SPL, Intra-group Calculation Optimization in SPL and Inter-row Calculation Optimization in SPL.
esProc is a desktop programming tool. It’s easy to install, and it’s convenient to debug code and view data in it:
esProc is powerfully independent yet integration-friendly. You can easily embed an SPL script into any Java program. Refer to How to Call an SPL Script in Java to learn details.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL