User Behavior Analysis in Practice 8: The Changing Dimension Table
Target task
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
ProductID |
Quantity |
… |
2022/6/1 10:20 |
1072755 |
1 |
7 |
… |
2022/6/1 12:12 |
1078030 |
2 |
8 |
… |
2022/6/1 12:36 |
1005093 |
3 |
3 |
… |
2022/6/1 13:21 |
1048655 |
4 |
9 |
… |
2022/6/1 14:46 |
1037824 |
5 |
5 |
… |
2022/6/1 15:19 |
1049626 |
6 |
4 |
… |
2022/6/1 16:00 |
1009296 |
7 |
6 |
… |
2022/6/1 16:39 |
1070713 |
8 |
7 |
… |
2022/6/1 17:40 |
1090884 |
9 |
4 |
… |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
String |
User ID |
ProductID |
Integer |
Product ID |
Quantity |
Numeric |
Quantity |
Dimension table Product:
ProductID |
ProductName |
Unit |
Origin |
ProductType |
1 |
Apple |
Pound |
Shandong |
Fruits |
2 |
Tissue |
Packs |
Guangdong |
Home&Personalcare |
3 |
Beef |
Pound |
Qingdao |
Meat |
4 |
Wine |
Bottles |
Shanxi |
Beverage |
5 |
Pork |
Pound |
Xizang |
Meat |
6 |
Bread |
Packs |
Beijing |
Bakery |
7 |
Juice |
Bottles |
Xinjiang |
Beverage |
… |
… |
… |
… |
… |
Fields of dimension table Product:
Field name |
Data type |
Description |
ProductID |
String |
Product ID |
ProductName |
String |
Product name |
Unit |
String |
Unit |
ProductTypeID |
String |
Product type |
Origin |
Numeric |
Product origin |
Computing task:
Find the sales quantity of each type of product of each place of origin within the specified time period.
One thing we should take into consideration is that the place of origin is not fixed, and get sales quantity of a type of product according to the place of origin in the current transaction.
Techniques involved:
Employing time key on the dimension table. You can find more about SPL time key mechanism in Why Are There So Many Snapshot Tables in BI Systems?.
1. Generate a dimension table with time key according to the dimension table in the production system at regular time interval.
Add a time key field to Product.btx. Here’s the dimension table’s new structure:
Field name |
Data type |
Description |
eTime |
Datetime |
Effective time for the current record |
ProductID |
String |
Product ID |
ProductName |
String |
Product name |
Unit |
String |
Sales unit |
ProductTypeID |
String |
Product type |
Origin |
Numeric |
Place of origin |
Perform daily data dump on Product.btx with a time key according to the updated information in the production system’s Product table every day.
2. Summarize data using the dimension table with a time key
Define a new composite primary key using eTime and ProductID for the dimension table while specifying eTime as the time key. Table T is associated with Product through Time and ProductID fields. Programmers do not need to take care of how to achieve the time key because SPL is designed to automatically handle the related computing logic. They just use the dimension table as a regular one.
Sample code
1. Dump dimension table Product.btx.
For original data: Directly add the time key field.
A |
|
1 |
=connect("demo").query@x("select * from Product").derive(now():eTime) |
2 |
= file("Product.btx").export@b(A1) |
When data is updated: Import the new dimension table, get the record with the latest time under each type of product from the dumped btx file with the time key, compare them with the new table, and append a new record to the btx file for the updated data and record a new effective time.
A |
|
1 |
=connect("demo").cursor@x("select * from Product") |
2 |
= T("Product.btx").keys@it(ProductID,eTime) |
3 |
=A1.select(cmp(~.array(),A2.find(A1.ProductID).array().to(2,))!=0) |
4 |
= A3.fetch().derive(now():eTime) |
5 |
=file("Product.btx").export@ab(A4) |
A1 Import the new dimension table and set index on the primary key.
A2 Import the old dimension table Product from the btx file, get record having the latest time for each product, and set index on the primary key.
@t option means the last key field is the time key.
A3 Get records under each primary key value from the new dimension table where the latest (largest) eTime is different from that in the records under same key value in the old dimension table. The find function will automatically select the record having the latest time corresponding to same key value.
A4 Add effective time field to A3’s records.
A5 Append and export A4’s records to the bin file.
2. Join the fact table and the dimension table and perform aggregation
Suppose T.ctx is already generated as the above explains and sorted by Time:
A |
|
1 |
>Product=T("Product.btx").keys@ti(ProductID,eTime) |
2 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
3 |
=file("T.ctx").open().cursor(ProductID,Quantity,Time;Time>=start && Time<=end) |
4 |
=A3.switch(ProductID:Time,Product) |
5 |
=A4.groups(ProductID.ProductType,ProductID.Origin; sum(Quantity):Quantity) |
A1 Import dimension table Product, and set index on the primary key while specifying the time key.
A4 Join the fact table and the dimension table using the familiar SPL join syntax. As the joining fields contains a time field, the operation will find the record having the latest time (the largest time that is smaller than eTime) before the current time field value.
Execution result:
ProductType |
Origin |
Quantity |
Fruits |
Shandong |
1241628 |
Fruits |
Xinjiang |
546357 |
Fruits |
Hainan |
24526 |
Home&Personalcare |
Guangdong |
7411008 |
Meat |
Qingdao |
3303230 |
Meat |
Neimeng |
657546 |
Meat |
Xizang |
2456235 |
Bakery |
Beijing |
247673 |
Beverage |
Xinjiang |
3526574 |
Beverage |
Shanxi |
6090112 |
… |
… |
… |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version