User Behavior Analysis in Practice 13: Bi-dimension Ordering
Target task
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
EventType |
OS |
Browser |
ProductID |
… |
f1 |
f2 |
f3 |
f4 |
f5 |
… |
2022/6/1 10:20 |
1072755 |
Search |
Android |
IE |
100001 |
… |
true |
false |
false |
true |
false |
… |
2022/6/1 12:12 |
1078030 |
Browse |
IOS |
Safari |
100002 |
… |
false |
false |
true |
true |
true |
… |
2022/6/1 12:36 |
1005093 |
Submit |
Android |
Chrome |
100003 |
… |
true |
true |
true |
false |
false |
… |
2022/6/1 13:21 |
1048655 |
Login |
Windows |
Chrome |
… |
false |
false |
true |
true |
true |
… |
|
2022/6/1 14:46 |
1037824 |
Logout |
Android |
Edge |
… |
false |
false |
false |
true |
true |
… |
|
2022/6/1 15:19 |
1049626 |
AddtoCart |
Windows |
Edge |
100004 |
… |
true |
true |
false |
true |
false |
… |
2022/6/1 16:00 |
1009296 |
Submit |
IOS |
Firefox |
100005 |
… |
false |
true |
false |
false |
true |
… |
2022/6/1 16:39 |
1070713 |
Browse |
IOS |
Sogou |
100006 |
… |
true |
true |
true |
false |
false |
… |
2022/6/1 17:40 |
1090884 |
Search |
Windows |
IE |
100007 |
… |
true |
false |
true |
true |
false |
… |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
String |
User ID |
EventType |
String |
Event type, whose value is Login, Browse, Search, AddtoCart, Submit or Logout |
OS |
String |
Operating system, whose value is Android, IOS, Windows or Unknown |
Browser |
String |
Browser, whose value is IE, Safari, Edge, Firefox, Chrome, Sogou or Unknown |
ProductID |
String |
Product ID, whose value is the ProductID field of dimension table Product |
… |
String |
Other fields that have enumerated values |
f1 |
Boolean |
Whether it is an offsite event or not; value is true or false |
f2 |
Boolean |
Whether it is a usual device or not; value is true or false |
f3 |
Boolean |
Whether it is a usual browser or not; value is true or false |
f4 |
Boolean |
Whether it is a cell phone or not; value is true or false |
f5 |
Boolean |
Whether it is the first operation; value is true or false |
… |
Boolean |
Other fields that have Boolean values |
Dimension table Product:
ProductID |
ProductName |
Unit |
Price |
ProductType |
100001 |
Apple |
Pound |
5.5 |
Fruits |
100002 |
Tissue |
Packs |
16 |
Home&Personalcare |
100003 |
Beef |
Pound |
35 |
Meat |
100004 |
Wine |
Bottles |
120 |
Beverage |
100005 |
Pork |
Pound |
25 |
Meat |
100006 |
Bread |
Packs |
10 |
Bakery |
100007 |
Juice |
Bottles |
6 |
Beverage |
… |
… |
… |
… |
… |
Fields in dimension table Product:
Field name |
Data type |
Description |
ProductID |
String |
Product ID |
ProductName |
String |
Product name |
Unit |
String |
Sales unit |
Price |
Numeric |
Unit price |
ProductType |
Integer |
Product type |
Computing task:
Count users who are not newcomers on a local Android or IOS system using Safari, Edge or Chrome and who perform the first N of a series of events (search, add to cart and submit order) under the product type Home & Personal care within a time window during the specified time period so that we can calculate the customer conversion rate and churn rate. Such a computing scenario is known as conversion funnel analysis.
Make a note that the total volume of data is large and the time span is long, but that each time window is relatively short. The other aspects are same as the circumstances in the conversion funnel analysis in series 11:
1. The three events should occur in the time order. Those that do not appear in order are not eligible.
2. The three events should happen under one user within a specified time window. Those that occur out of the time range are not included.
3. Begin timing at the occurrence of the first event. If the subsequent events occur in order within the time window, each event is recorded as 1; otherwise, it is recorded as 0. If the frequency of an event is 0, there is no need to scan the subsequent events.
Techniques involved
Learn more about bi-dimension ordering in articles SPL Pseudo Table Bi-dimension Ordering Structure and Speed up Big Data User Behavior Analysis using Bi-dimension Ordering Structure.
The use of bi-dimension ordering structure enables ordering data as a whole by Time (which helps achieve fast filtering) and the result set by UserID (for convenience of subsequent computations). It appears that data is ordered by two dimensions at the same time through this structure.
Sample code
1. Define a multizone composite table and split table by month@y()(suppose data is within 2010-2021).
A |
|
1 |
=to(2021,2022).conj((a=~*100,12.(~+a))) |
2 |
=file("T.ctx":A1).create@y(#UserID,#Time,Month,EventType,OS,Browser,ProductID,b1;month@y(Time)) |
3 |
=A2.close() |
A1 Define a sequence of yearmonths whose members will be zone table numbers, that is, one zone table for each month.
A2 Define a multizone composite table, use A1 as its zone table numbers and define zone table number’s computing expression as month@y(Time).
2. Define a pseudo table based on the multizone composite table
A |
|
… |
/ The above multizone composite table definition code |
4 |
=T("Product.btx").keys@i(ProductID) |
5 |
=[{file:"T.ctx", zone:A1, user:"UserID", date:"Time", column:[ {name:"Month",exp:"month@y(Time)"}, {name:"EventType",pseudo:"EventTypeName",enum:["Login","Browse","Search","AddtoCart","Submit","Logout"]}, {name:"OS",pseudo:"OSName",enum:["Android","IOS","Windows","Unknown"]}, {name:"Browser",pseudo:"BrowserName",enum:["IE","Safari","Edge","Firefox","Chrome","Sogou","Unknown"]}, {name:"b1",bits:["f1","f2","f3","f4","f5"]}, {name:"ProductID",dim:A4}] }] |
6 |
=pseudo(A5) |
A4 Import dimension table Product and create primary key and index for it.
A5 Define a pseudo table based on multizone composite table T.ctx. zone is the list of zone table numbers; date is the zone field, which is Time here; and user is the account field, which is UserID here. You can learn more about the roles of zone field and account filed in above hyperlinked essays.
Read the last essay in this series to learn about explanations of other definitions.
A6 Generate pseudo table.
3. Retrieve data from the table T, sort data by UserID and Time and append it to the multizone composite table through the pseudo table
A |
|
… |
/ The above pseudo table definition code |
7 |
=connect("demo").cursor@x("select * from T").sortx(UserID,Time) |
8 |
=A6.append@i(A7) |
A7 Connect to the database, retrieve data from table T to generate a cursor, and sort the cursor data by UserID and Time.
A8 Fetch data from the cursor, and append it to the multizone composite table through the pseudo table.
In actual practice, new data is added continuously. Suppose Time field contains the time when an event occurs, the newly-added data always has a larger Time value. The new data is either appended to the last zone table or written to a new zone table, but they won’t be inserted to any other zone table. SPL can automatically ensure that data in the last zone table is always ordered by UserID, so we can still append the newly increased data using the above code.
4. Summarize data using the pseudo table
A |
B |
C |
D |
|
… |
/ The above pseudo table definition code |
|||
7 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-04-05","yyyy-MM-dd"),tw=7 |
|||
8 |
[Search,AddtoCart,Submit] |
=A8.(0) |
||
9 |
=A6.cursor(UserID, EventType,Time;Time>=start && Time<=end && A8.contain(EventType) && ProductID.ProductType=="Home&Personalcare"&& ["Safari","Edge","Chrome"].pos(BrowserName) && ["Android","IOS"].pos(OSName) && ! f1 && f4 && !f5) |
|||
10 |
for A9;UserID |
=first=A10.select@1(EventType==A8(1)) |
||
11 |
if(B10==null) |
next |
||
12 |
=t=null |
=A8.(null) |
||
13 |
for A8 |
if #B13==1 |
>C12(1)=t=t1=first.Time |
|
14 |
else |
>C12(#B13)=t=if(t,A10.select@1(EventType==B13 && Time>t && Time<elapse(t1,tw)).Time,null) |
||
15 |
=C12.(if(~,1,0)) |
|||
16 |
>D8=D8++B15 |
|||
17 |
return D8 |
A7 Define a time interval through two parameters and the time window tw. Thei values will be passed as parameters in during the computation.
A8 Names of target events. Note that their order is important.
D8 Generate an array having same length as A8’s to store the number of occurrences of each event. This is the final return result.
A9 Generate cursor for the pseudo table and perform filtering on the cursor data according to the specified time interval, target events, product type, browsers and devices. A fast filtering will be automatically executed on the pseudo table to select desired data from the specified time interval.
A10 Loop data in the cursor and retrieve all data of one user each time. The pseudo table ensures that data is ordered by UserID. Here we can use same approach as that we used to handle an ordered cursor in the above.
The rest of the computation is the same as that on an ordered cursor.
B10 Get the record where the first event occurs for the first time, and assign it to variable first.
B11 If the first event does not occur, move on to the next user as the current user has become ineligible and there is no need to perform the computation on it.
B12 Define a variable t to store the time when the current event occurs during the later loop.
C12 Define an array of same length as A8 to store the time when each event occurs during the later loop.
B13 Loop each member of A8.
C13-D13 If the loop number in B13 is 1, the current event is the first one. The system records the time when the first event, which is now the value of t, occurs and assigns the time to t1.
C14-D14 If the current event isn’t the first one, check whether the previous value of t is null. If it is null, assign null to the current t; if it isn’t null, find the record of event that occurs the earliest from those where the time is greater than t and less than the time window t1+tw and assign the time to t.
B15 Loop C12, and record frequency for time that is null as 0 and that for time that is non-null as 1.
B16 Add the number of frequencies in B15 to D8.
Execution result:
Member |
393400 |
257539 |
83375 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version