"Target task: We have a user events table T. Below is its structure and part of its data: Time Us .."

Hoo RaqForum 19 No.
1 Reply • 374 View • 2 Years ago

User Behavior Analysis in Practice 2: Redundant Grouping Key Field

Target task:

We have a user events table T. Below is its structure and part of its data:

Time	UserID	EventTypeID	EventType
2022/6/1 10:20	1072755	3	Search
2022/6/1 12:12	1078030	2	Browse
2022/6/1 12:36	1005093	5	Submit
2022/6/1 13:21	1048655	1	Login
2022/6/1 14:46	1037824	6	Logout
2022/6/1 15:19	1049626	4	AddtoCart
2022/6/1 16:00	1009296	5	Submit
2022/6/1 16:39	1070713	2	Browse
2022/6/1 17:40	1090884	3	Search

Fields in table T:

Field name	Data type	Description
Time	Datetime	Time stamp of an event, accurate to milliseconds
UserID	Integer	User ID
EventTypeID	Integer	Event type ID
EventType	String	Event type name

Computing task:

Find the number of events under each type and that of distinct users who perform that type of event in the specified time period, and display the event type name in the result.

Both the code fields and value fields, which can be uniquely determined by corresponding code fields have been generated at the creation of the wide table so that less JOIN operations will be needed. In our case, each event type name is determined by one event type ID.

Techniques involved:

Redundant grouping key field: With a composite grouping key where a certain key field or certain key fields can determine values of another field or other fields, we can use them as the grouping field(s) and retrieve the first matching value in each controlled field, which will not engage in the grouping operation any more. This leads to better performance.

In our case, values of EventType are completely determined by EventTypeID field. So, we can group table only by EventTypeID and get the first corresponding value of EventType field.

Sample code

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:

	A
1	=file("T.btx").cursor@mb()
2	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3	=A1.select(Time<=end && Time>=start).groups(EventTypeID; EventType, count(1):Num, icount(UserID):iNum)

In groups function, the grouping field EventTypeID comes before the semicolon. Since EventType is commanded by EventTypeID, it is written in the aggregation part without being preceded by an aggregate function – meaning retrieving the first member of the current group directly. Excluding such a field from comparisons of the grouping operation makes computation faster.

The grouping field must be one that determines the other or other fields and the controlled field(s) should be written after the semicolon. They need to be written at right positions.

Execution result:

EventTypeID	EventType	Num	iNum
1	Login	4033000	393400
2	Browse	3578901	348791
3	Search	2947931	257539
4	AddtoCart	1845674	175476
5	Submit	867345	83375
6	Logout	4033000	393400

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Hoo • 374 View • 2 Years ago

User Behavior Analysis in Practice 2: Redundant Grouping Key Field

Target task:

Techniques involved:

Sample code

ToC