6.16 Order-based grouping: by continuous same value – big data
With a huge amount of data, create a new group when the next grouping field value becomes different and then summarize data in each group.
We have a large log file where logs are output according to datetime. The task is to find the date when the ERROR log level appears the most.
Date | Time | Level | IP | … |
---|---|---|---|---|
2020/1/1 | 0:00:01 | INFO | 166.253.153.234 | … |
2020/1/1 | 0:00:02 | INFO | 99.72.133.239 | … |
2020/1/1 | 0:00:04 | WARM | 99.11.105.39 | … |
2020/1/1 | 0:00:05 | INFO | 117.69.80.195 | … |
2020/1/1 | 0:00:11 | INFO | 79.195.137.228 | … |
… | … | … | … | … |
SPL has cs.group() function to group a huge number of records, during which it creates a new group whenever the next neighboring value in the grouping field changes.
SPL script:
A | |
---|---|
1 | =file(“ServerLog.txt”).cursor@t() |
2 | =A1.group(Date,Level;count(~):Count) |
3 | =A2.select(Level:“ERROR”) |
4 | =A3.top(1;ErrorCount) |
A1 Create cursor for the log file.
A2 Use cs.group() function to perform grouping where it generates a new group whenever the date and log level in the next neighboring record change.
A3 Get groups of log level ERROR.
A4 Get the group containing the largest number of continuous ERROR level.
Execution result:
Date | ErrorCount |
---|---|
2020/01/02 | 4 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL