6.16 Order-based grouping: by continuous same value – big data

 

With a huge amount of data, create a new group when the next grouping field value becomes different and then summarize data in each group.
We have a large log file where logs are output according to datetime. The task is to find the date when the ERROR log level appears the most.

Date Time Level IP
2020/1/1 0:00:01 INFO 166.253.153.234
2020/1/1 0:00:02 INFO 99.72.133.239
2020/1/1 0:00:04 WARM 99.11.105.39
2020/1/1 0:00:05 INFO 117.69.80.195
2020/1/1 0:00:11 INFO 79.195.137.228

SPL has cs.group() function to group a huge number of records, during which it creates a new group whenever the next neighboring value in the grouping field changes.

SPL script:

A
1 =file(“ServerLog.txt”).cursor@t()
2 =A1.group(Date,Level;count(~):Count)
3 =A2.select(Level:“ERROR”)
4 =A3.top(1;ErrorCount)

A1 Create cursor for the log file.
A2 Use cs.group() function to perform grouping where it generates a new group whenever the date and log level in the next neighboring record change.
A3 Get groups of log level ERROR.
A4 Get the group containing the largest number of continuous ERROR level.

Execution result:

Date ErrorCount
2020/01/02 4