User Behavior Analysis in Practice 11: Order-based Grouping
Target task
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
EventType |
… |
2022/6/1 10:20 |
1072755 |
Search |
… |
2022/6/1 12:12 |
1078030 |
Browse |
… |
2022/6/1 12:36 |
1005093 |
Submit |
… |
2022/6/1 13:21 |
1048655 |
Login |
… |
2022/6/1 14:46 |
1037824 |
Logout |
… |
2022/6/1 15:19 |
1049626 |
AddtoCart |
… |
2022/6/1 16:00 |
1009296 |
Submit |
… |
2022/6/1 16:39 |
1070713 |
Browse |
… |
2022/6/1 17:40 |
1090884 |
Search |
… |
… |
… |
… |
… |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
String |
User ID |
EventType |
String |
Event type |
… |
… |
… |
Computing task:
Count users who perform the first N of a series of events (search, add to cart and submit order) within a time window so that we can calculate the customer conversion rate and churn rate. Such a computing scenario is known as conversion funnel analysis.
Note:
1. The three events should occur in the time order. Those that do not appear in order are not eligible.
2. The three events should happen under one user within a specified time window. Those that occur out of the time range are not included.
3. Begin timing at the occurrence of the first event. If the subsequent events occur in order within the time window, each event is recorded as 1; otherwise, it is recorded as 0. If the frequency of an event is 0, there is no need to scan the subsequent events.
Techniques involved
Find more about conversion funnel analysis in SQL Performance Enhancement: Conversion Funnel Analysis.
Read data of each account from the table ordered by account and time into the memory and then perform complicated computations.
The algorithm used to perform this computing task is complicated. It is inconvenient to achieve it on external storage, but needs to be implemented in the memory. Here the volume of data of one user is very small, and they can be all loaded into the memory.
When the original data is already ordered by account and time and a cursor is created for it, we can retrieve whole data of only one user each time. Now the data is naturally ordered by time and we can perform the conversion funnel analysis in the memory.
Sample code
According to the above-mentioned approach, we sort data in table T by UserID and Time and dump it in T.ctx.
Then loop data in the cursor, retrieve one user each time to perform funnel conversion analysis.
A |
B |
C |
D |
|
1 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-04-05","yyyy-MM-dd"),tw=7 |
|||
2 |
[Search,AddtoCart,Submit] |
=A2.(0) |
||
3 |
=file("T.ctx").open().cursor(UserID, EventType,Time;Time>=start && Time<=end && A2.contain(EventType)) |
|||
4 |
for A3;UserID |
=first=A4.select@1(EventType==A2(1)) |
||
5 |
if(B4==null) |
next |
||
6 |
=t=null |
=A2.(null) |
||
7 |
for A2 |
if #B7==1 |
>C6(1)=t=t1=first.Time |
|
8 |
else |
>C6(#B7)=t=if(t,A4.select@1(EventType==B7 && Time>t && Time<elapse(t1,tw)).Time,null) |
||
9 |
=C6.(if(~,1,0)) |
|||
10 |
>D2=D2++B9 |
|||
11 |
return D2 |
A1 Define a time interval through two parameters and the time window tw. The values will be passed in as parameters during the computation.
A2 Names of target events. Note that their order is important.
D2 Generate an array having same length as A2’s to store the number of occurrences of each event. This is the final return result.
A3 Open the composite table file and generate its cursor, where filtering on time and event is performed.
A4 Loop the cursor and retrieve all data of one user each time.
B4 Get the record where the first event occurs for the first time, and assign it to variable first.
B5 If the first event does not occur, move on to the next user as the current user has become ineligible and there is no need to perform the computation on it.
B6 Define a variable t to store the time when the current event occurs during the later loop.
C6 Define an array of same length as A2 to store the time when each event occurs during the later loop.
B7 Loop each member of A2.
C7-D7 If the loop number in B7 is 1, the current event is the first one. The system records the time when the first event, which is now the value of t, occurs and assigns the time to t1.
C8-D8 If the current event isn’t the first one, check whether the previous value of t is null. If it is null, assign null to the current t; if it isn’t null, find the record of event that occurs the earliest from those where the time is greater than t and less than the time window t1+tw and assign the time to t.
B9 Loop C6, and record frequency for time that is null as 0 and that for time that is non-null as 1.
B10 Add the number of frequencies in B9 to D2.
We can use parallel processing to perform the computation faster:
A |
B |
C |
D |
E |
|
1 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-04-05","yyyy-MM-dd"),tw=7 |
||||
2 |
[Search,AddtoCart,Submit] |
||||
3 |
=file("T.ctx").open().cursor@m(UserID, EventType,Time;Time>=start && Time<=end && A2.contain(EventType);2) |
||||
4 |
fork A3 |
=A2.(0) |
|||
5 |
for A4;UserID |
=first=B5.select@1(EventType==A2(1)) |
|||
6 |
if(first==null) |
next |
|||
7 |
=t=null |
=A2.(null) |
|||
8 |
for A2 |
if #C8==1 |
>D7(1)=t=t1=first.Time |
||
9 |
else |
>D7(#C8)=t=if(t,B5.select@1(EventType==C8 && Time>t && Time<elapse(t1,tw)).Time,null) |
|||
10 |
=D7.(if(~,1,0)) |
||||
11 |
>B4=B4++C10 |
||||
12 |
return B4 |
||||
13 |
return transpose(A4).(~.sum()) |
A3 The cursor function works with @m option to generate a multicursor.
A4 fork A3 generates multiple threads to perform parallel processing.
B4 Generate an array of same length as A2 to store result of the current thread.
B12 Return result of the current thread, and store it in A4 to form a sequence whose length is equivalent to the number of threads.
A13 Transpose A4’s result and perform sum to get the final result.
Execution result:
Member |
393400 |
257539 |
83375 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version