User Behavior Analysis in Practice 6: Numberizing the Dimension Table
Target task:
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
EventTypeID |
ProductID |
Quantity |
2022/6/1 10:20 |
1072755 |
3 |
100001 |
|
2022/6/1 12:12 |
1078030 |
2 |
100002 |
|
2022/6/1 12:36 |
1005093 |
5 |
100003 |
3 |
2022/6/1 13:21 |
1048655 |
1 |
||
2022/6/1 14:46 |
1037824 |
6 |
||
2022/6/1 15:19 |
1049626 |
4 |
100004 |
4 |
2022/6/1 16:00 |
1009296 |
5 |
100005 |
6 |
2022/6/1 16:39 |
1070713 |
2 |
100006 |
|
2022/6/1 17:40 |
1090884 |
3 |
100007 |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
String |
User ID |
EventTypeID |
Integer |
Event type ID |
ProductID |
String |
Product ID |
Quantity |
Numeric |
Quantity |
Dimension table EventType:
EventTypeID |
EventType |
1 |
Login |
2 |
Browse |
3 |
Search |
4 |
AddtoCart |
5 |
Submit |
6 |
Logout |
Dimension table Product:
ProductID |
ProductName |
Unit |
Price |
ProductTypeID |
100001 |
Apple |
Pound |
5.5 |
1 |
100002 |
Tissue |
Packs |
16 |
2 |
100003 |
Beef |
Pound |
35 |
3 |
100004 |
Wine |
Bottles |
120 |
4 |
100005 |
Pork |
Pound |
25 |
3 |
100006 |
Bread |
Packs |
10 |
5 |
100007 |
Juice |
Bottles |
6 |
4 |
… |
… |
… |
… |
… |
Fields in dimension table Product:
Field name |
Data type |
Description |
ProductID |
String |
Product ID |
ProductName |
String |
Product name |
Unit |
String |
Sales unit |
Price |
Numeric |
Unit prices |
ProductTypeID |
Integer |
Product type ID |
Dimension table ProductType:
ProductTypeID |
ProductType |
1 |
Fruits |
2 |
Home&Personalcare |
3 |
Meat |
4 |
Beverage |
5 |
Bakery |
… |
… |
Relationship between tables:
Computing task:
Calculate the total sales amount, number of orders, search frequency and the number of distinct users performing search and ordering under each type of product within a specified time period.
Techniques involved:
1. Associate with dimension table using ordinal-number-based location.
In both EventType and ProductType tables, primary keys are ordinal natural numbers. They can be directly used to locate records and achieve associations without creating index and computing and comparing HASH values. This can boost performance.
2. Convert a dimension table’s primary key values that are non-ordinal-numbers into ordinal natural numbers so that ordinal-number-based location can be used to speed up association.
Product table’s primary key values are not ordinal numbers represented by natural numbers, but we can first transform them into ordinal numbers and, in the meantime, convert ProductID field values in user events table T into corresponding ordinal number. Now ordinal-number-based location can be used.
Sample code
The code has five parts:
1. According to practices in previous essays, we dump data in those dimension tables as bin files EventType.btx, Product.btx and ProductType.btx that are respectively ordered by EventTypeID, ProductType and ProductTypeID.
2. Join user events table T and EventType table, Product table and ProductType table through ordinal numbers.
3. Add an ordinal number field to Product table.
4. Dump data from user events table T to store it in composite table T.ctx ordered by Time field, and transform ProductID values into ordinal numbers of corresponding records in Product table.
5. Import each dimension table into memory, open the composite table cursor, establish associations with dimension tables during which the association with Product table is through ordinal numbers. It is not necessary to set primary key and create index for dimension table Product.
1. Dump dimension table data
A |
|
1 |
=connect("demo") |
2 |
=A1.query("select * from Product") |
3 |
=file("Product.btx").export@b(A1) |
4 |
=A1.query("select * from EventType order by EventTypeID") |
5 |
=file("EventType.btx").export@b(A4) |
6 |
=A1.query@x("select * from ProductType order by ProductTypeID") |
7 |
=file("ProductType.btx").export@b(A6) |
A4 Sort by EventTypID.
A6 Sort by ProductTypID.
2. Change types of association between EventType table and table T, ProductType table and table T into ordinal number-based association.
A |
|
1 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
2 |
=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 || EventTypeID==3)) |
3 |
>EventType=file("EventType.btx").import@b() |
4 |
>ProductType=file("ProductType.btx").import@b() |
5 |
>Product=file("Product.btx").import@b().keys@i(ProductID) |
6 |
>Product=Product.switch(ProductTypeID, ProductType:#) |
7 |
=A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:#) |
8 |
=A7.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum) |
A3-A4 Do not set create indexes on primary keys for EventType and ProductType.
A6 Change type of association between ProductType table and table T into ordinal number-based association.
A7 Change type of association between EventType table and table T into ordinal number-based association.
3. Add an ordinal number field in Product table.
Original data: add an ordinal number field to it directly.
A |
|
1 |
=connect("demo").query@x("select * from Product").derive(#:ProductNum) |
2 |
= file("Product.btx").export@b(A1) |
Updated data: Whenever a dimension table is updated, it is wholly retrieved and compared with the dumped btx file. The comparison is based on ordinal numbers of the dumped records, otherwise historical data will be mismatched. The newly-increased data is placed at the end. Usually there are no deletion actions on dimension tables. Any deletion of dimension data will cause errors when historical records of the fact table try to reference records of the dimension table.
A |
|
1 |
.keys@i(ProductID)=connect("demo").query@x("select * from Product").derive(:ProductNum) |
2 |
= file("Product.btx").import@b().keys@i(ProductID) |
3 |
=A1.select(A2.find(A1.ProductID)==null) |
4 |
=A2.(if(r=A1.find(A2.ProductID),r,~) ) |
5 |
=(A4|A3).run(ProductNum=#) |
6 |
=file("Product.btx").export@b(A5) |
A1 Load the updated dimension table, add ProductNum field and set primary key and index.
A2 Load the original dimension table Product from the corresponding bin file and set primary key.
A3 Find the newly-increased records in the updated dimension table.
A4 If a record in the original dimension table exist in the updated one, use the new record; if it does not exist, use the original record.
A5 Union A4 and A3 and set ordinal numbers. As A4 keeps the order of the original dimension table, ordinal numbers in the original dimension table are retained.
A6 Write A5’s result to a bin file.
4. The code of preparing file T.ctx, during which ProductID field values are changed into ordinal numbers.
Take stocked data as an example:
A |
|
1 |
>Product=file("Product.btx").import@b().keys@i(ProductID) |
2 |
=connect("demo").cursor@x("select * from T order by Time") |
3 |
=A2.run(ProductID=Product.find(A2.ProductID).ProductNum) |
4 |
=file("T.ctx").create@y(#Time,UserID,EventTypeID, ProductID, Quantity) |
5 |
.append(A3)=A4 |
6 |
>A4.close() |
A1 Load dimension table Product into the memory and create index on primary key.
A2 Sort table T by time while retrieving data from it.
A3 Replace ProductID field values in table T with ordinal numbers of corresponding records in the dimension table.
A4 Create a composite table.
A5 Append data of table T to A4’s composite table.
Similar code for the newly-increased data.
5. Perform the whole analysis on the converted data, all through ordinal numbers.
Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:
A |
|
1 |
>EventType=file("EventType.btx").import@b() |
2 |
>ProductType=file("ProductType.btx").import@b() |
3 |
>Product=file("Product.btx").import@b() |
4 |
>Product=Product.switch(ProductTypeID, ProductType:#) |
5 |
=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 || EventTypeID==3)) |
6 |
=A5.switch(ProductID,Product:#; EventTypeID,EventType:#) |
7 |
=A6.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum) |
A1-A3 Do not set primary keys.
A6 A join via ordinal numbers.
Execution result:
EventTypeID |
ProductTypeID |
EventType |
ProductType |
Quantity |
Num |
iNum |
3 |
1 |
Search |
Fruits |
0 |
499586 |
48735 |
3 |
2 |
Search |
Home&Personalcare |
0 |
508897 |
49872 |
3 |
3 |
Search |
Meat |
0 |
403213 |
39923 |
3 |
4 |
Search |
Beverage |
0 |
324567 |
29045 |
3 |
5 |
Search |
Bakery |
0 |
335498 |
30234 |
… |
… |
… |
… |
… |
… |
… |
5 |
1 |
Submit |
Fruits |
206938 |
103469 |
13523 |
5 |
2 |
Submit |
Home&Personalcare |
463188 |
154396 |
14656 |
5 |
3 |
Submit |
Meat |
94378 |
93366 |
8754 |
5 |
4 |
Submit |
Beverage |
217504 |
54376 |
5233 |
5 |
5 |
Submit |
Bakery |
339480 |
67896 |
5844 |
… |
… |
… |
… |
… |
… |
… |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL