User Behavior Analysis in Practice 12: Using Pseudo Tables
Target task
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
EventType |
OS |
Browser |
ProductID |
… |
f1 |
f2 |
f3 |
f4 |
f5 |
… |
2022/6/1 10:20 |
1072755 |
Search |
Android |
IE |
100001 |
… |
true |
false |
false |
true |
false |
… |
2022/6/1 12:12 |
1078030 |
Browse |
IOS |
Safari |
100002 |
… |
false |
false |
true |
true |
true |
… |
2022/6/1 12:36 |
1005093 |
Submit |
Android |
Chrome |
100003 |
… |
true |
true |
true |
false |
false |
… |
2022/6/1 13:21 |
1048655 |
Login |
Windows |
Chrome |
… |
false |
false |
true |
true |
true |
… |
|
2022/6/1 14:46 |
1037824 |
Logout |
Android |
Edge |
… |
false |
false |
false |
true |
true |
… |
|
2022/6/1 15:19 |
1049626 |
AddtoCart |
Windows |
Edge |
100004 |
… |
true |
true |
false |
true |
false |
… |
2022/6/1 16:00 |
1009296 |
Submit |
IOS |
Firefox |
100005 |
… |
false |
true |
false |
false |
true |
… |
2022/6/1 16:39 |
1070713 |
Browse |
IOS |
Sogou |
100006 |
… |
true |
true |
true |
false |
false |
… |
2022/6/1 17:40 |
1090884 |
Search |
Windows |
IE |
100007 |
… |
true |
false |
true |
true |
false |
… |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
String |
User ID |
EventType |
String |
Event type, whose value is Login, Browse, Search, AddtoCart, Submit or Logout |
OS |
String |
Operating system, whose value is Android, IOS, Windows or Unknown |
Browser |
String |
Browser, whose value is IE, Safari, Edge, Firefox, Chrome, Sogou or Unknown |
ProductID |
String |
Product ID, whose value is the ProductID field of dimension table Product |
… |
String |
Other fields that have enumerated values |
f1 |
Boolean |
Whether it is an offsite event or not; value is true or false |
f2 |
Boolean |
Whether it is a usual device or not; value is true or false |
f3 |
Boolean |
Whether it is a usual browser or not; value is true or false |
f4 |
Boolean |
Whether it is a cell phone or not; value is true or false |
f5 |
Boolean |
Whether it is the first operation; value is true or false |
… |
Boolean |
Other fields that have Boolean values |
In the previous article User Behavior Analysis in Practice 9: Enumerated Dimension and Tag Dimension, we convert the enumerated field and the tag field into corresponding ordinal numbers and bits respectively. Below is the structure and part of data of converted user events composite table T.ctx:
Time |
UserID |
EventType |
OS |
Browser |
ProductID |
… |
b1 |
… |
2022/6/1 10:20 |
1072755 |
3 |
1 |
1 |
100001 |
… |
36864 |
… |
2022/6/1 12:12 |
1078030 |
2 |
2 |
2 |
100002 |
… |
14336 |
… |
2022/6/1 12:36 |
1005093 |
5 |
1 |
5 |
100003 |
… |
57344 |
… |
2022/6/1 13:21 |
1048655 |
1 |
3 |
5 |
… |
14336 |
… |
|
2022/6/1 14:46 |
1037824 |
6 |
1 |
3 |
… |
6144 |
… |
|
2022/6/1 15:19 |
1049626 |
4 |
3 |
3 |
100004 |
… |
53248 |
… |
2022/6/1 16:00 |
1009296 |
5 |
2 |
4 |
100005 |
… |
18432 |
… |
2022/6/1 16:39 |
1070713 |
2 |
2 |
6 |
100006 |
… |
57344 |
… |
2022/6/1 17:40 |
1090884 |
3 |
3 |
1 |
100007 |
… |
45056 |
… |
Fields in converted table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
String |
User ID |
EventType |
Integer |
Event type, whose value is an ordinal number of the enumerated sequence |
OS |
Integer |
Operating system, whose value is an ordinal number of the enumerated sequence |
Browser |
Integer |
Browser, whose value is an ordinal number of the enumerated sequence |
ProductID |
String |
Product ID, whose value is the ProductID field of dimension table Product |
… |
… |
… |
b1 |
Integer |
Integer field that stores binary fields as bits; the first five bits correspond to whether it is offsite, whether it is the usual device, whether it is the usual browser, whether it is cell phone, and whether it is the first operation |
… |
… |
… |
Ordinal numbers in EventType field and their description:
1 Login
2 Browse
3 Search
4 AddtoCart
5 Submit
6 Logout
Ordinal numbers in OS field and their description:
1 Android
2 IOS
3 Windows
4 Unknown
Ordinal numbers in Browser field and their description:
1 IE
2 Safari
3 Edge
4 Firefox
5 Chrome
6 Sogou
7 Unknown
Dimension table Product:
ProductID |
ProductName |
Unit |
Price |
ProductType |
100001 |
Apple |
Pound |
5.5 |
Fruits |
100002 |
Tissue |
Packs |
16 |
Home&Personalcare |
100003 |
Beef |
Pound |
35 |
Meat |
100004 |
Wine |
Bottles |
120 |
Beverage |
100005 |
Pork |
Pound |
25 |
Meat |
100006 |
Bread |
Packs |
10 |
Bakery |
100007 |
Juice |
Bottles |
6 |
Beverage |
… |
… |
… |
… |
… |
Fields in dimension table Product:
Field name |
Data type |
Description |
ProductID |
String |
Product ID |
ProductName |
String |
Product name |
Unit |
String |
Sales unit |
Price |
Numeric |
Unit price |
ProductType |
Integer |
Product type |
Relationship between T and Product:
Computing task:
Find the number of occurrences of each type of event performed by users who are not newcomers on a local Android or IOS system using Safari, Edge or Chrome under the product type Home & Personal care in each month within a specified time period, and count the distinct users under same conditions.
Techniques involved:
Learn more about SPL pseudo tables in SPL Pseudo Table Data Type Optimization.
We can use the pseudo table to predefine computed columns and foreign keys, and simplify the representations of enumerated dimensions and tag dimensions:
1. Define a computed column Month to calculate the month in the time when the current event occurs.
2. Define pseudo fields EventTypeName, OSName and BrowserName to store correspondence relationships between EventType, OS and Browser’s ordinal numbers and their names.
3. Define the foreign key association on ProductID field between it and the corresponding dimension table.
4. Define field name represented by each bit in b1.
After a pseudo table is defined, we can use it to achieve data dump and summarization. This can significantly reduce the amount of code for summarization.
Sample code
1. Define pseudo table
A |
|
1 |
=T("Product.btx").keys@i(ProductID) |
2 |
=file("T.ctx").create(#Time,UserID,EventType,OS,Browser,ProductID,……,b1,……) |
3 |
=A2.close() |
4 |
=[{file:"T.ctx", column:[ {name:"Month",exp:"month@y(Time)"}, {name:"EventType",pseudo:"EventTypeName",enum:["Login","Browse","Search","AddtoCart","Submit","Logout"]}, {name:"OS",pseudo:"OSName",enum:["Android","IOS","Windows","Unknown"]}, {name:"Browser",pseudo:"BrowserName",enum:["IE","Safari","Edge","Firefox","Chrome","Sogou","Unknown"]}, {name:"b1",bits:["f1","f2","f3","f4","f5",…….]}, {name:"ProductID",dim:A1}] }] |
5 |
=pseudo(A4) |
A1 Import dimension table Product, and set primary key and index for it.
A2 Create composite table structure after enumerated fields are converted into ordinal numbers and binary fields are transformed to bits, and write it to the empty file T.ctx (database data will later be directly appended to the file through the pseudo table without writing the conversion code for ordinal numbers and bits).
A3 Close A2’s composite table.
A4 Define a pseudo table based on composite table T.ctx:
{name:"Month",exp:"month@y(Time)"}: The code defines a computed column named Month and evaluated through month@y(Time).
{name:"EventType", pseudo:"EventTypeName", enum:[ "Login", "Browse", "Search", "AddtoCart", "Submit", "Logout"]}: The code defines correspondence relationship between EventType field values and the enumerated sequence, and represents the matching name through pseudo field EventTypeName.
{name:"OS",pseudo:"OSName",enum:["Android","IOS","Windows","Unknown"]}: The code defines correspondence relationship between OS field values and the enumerated sequence, and represents the matching name through pseudo field OSName.
{name:"Browser",pseudo:"BrowserName",enum:["IE","Safari","Edge","Firefox","Chrome","Sogou","Unknown"]}: The code defines correspondence relationship between Browser field values and the enumerated sequence, and represents the matching name through pseudo field BrowserName.
{name:"b1",bits:["f1","f2","f3","f4","f5",…….]}: The code defines field name each bit in the bit-based dimension b1 represents.
{name:"ProductID",dim:A1}: The code defines association relationship on ProductID field between it and dimension table A1.
A5 Generate the pseudo table.
The pseudo table definition string can be saved for direct use in later computations. This can further reduce the amount of code.
2. Use pseudo table to dump data. SPL will automatically convert data into corresponding ordinal numbers and bits according to pseudo table definition, and store the converted data in table T.
A |
|
… |
/ The above code for defining pseudo table |
6 |
=connect("demo").cursor@x("select * from T order by Time") |
7 |
=A5.append@i(A6) |
A6 Connect to the database, and import data in table T to generate a cursor.
A7 Retrieve and append data in the cursor to the pseudo table.
3. Perform aggregation using the pseudo table
A |
|
… |
/ The above code for defining pseudo table |
6 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
7 |
=A5.select(Time>=start && Time<=end && ProductID.ProductType=="Home&Personalcare"&& ["Safari","Edge","Chrome"].pos(BrowserName) && ["Android","IOS"].pos(OSName) && ! f1 && f4 && !f5).groups(Month,EventTypeName; count(1):Num, icount(UserID):iNum) |
A7 Perform aggregation using the pseudo table. We can use pseudo fields defined for the pseudo table as ordinary fields and treat the pseudo table as an ordinary, simple table without taking care of storage and computing mechanisms.
Execution result:
Month |
EventTypeName |
Num |
iNum |
202203 |
AddtoCart |
307603 |
29252 |
202203 |
Browse |
596492 |
58140 |
202203 |
Login |
672163 |
65569 |
202203 |
Logout |
672163 |
65569 |
202203 |
Search |
491317 |
42919 |
202203 |
Submit |
144552 |
13901 |
202204 |
AddtoCart |
615222 |
58484 |
202204 |
Browse |
1192970 |
116265 |
202204 |
Login |
1344323 |
131123 |
202204 |
Logout |
1344323 |
131123 |
202204 |
Search |
982637 |
85843 |
202204 |
Submit |
289112 |
27799 |
202205 |
AddtoCart |
615214 |
58484 |
202205 |
Browse |
1192976 |
116262 |
202205 |
Login |
1344339 |
131133 |
202205 |
Logout |
1344339 |
131133 |
202205 |
Search |
982633 |
85848 |
202205 |
Submit |
289108 |
27788 |
202206 |
AddtoCart |
307635 |
29256 |
202206 |
Browse |
596463 |
58124 |
202206 |
Login |
672175 |
65575 |
202206 |
Logout |
672175 |
65575 |
202206 |
Search |
491344 |
42929 |
202206 |
Submit |
144573 |
13887 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version