"Target task We have a user events table T. Below is its structure and part of its data: Time Use .."

Hoo RaqForum 19 No.
1 Reply • 434 View • 2 Years ago

User Behavior Analysis in Practice 8: The Changing Dimension Table

Target task

We have a user events table T. Below is its structure and part of its data:

Time	UserID	ProductID	Quantity	…
2022/6/1 10:20	1072755	1	7	…
2022/6/1 12:12	1078030	2	8	…
2022/6/1 12:36	1005093	3	3	…
2022/6/1 13:21	1048655	4	9	…
2022/6/1 14:46	1037824	5	5	…
2022/6/1 15:19	1049626	6	4	…
2022/6/1 16:00	1009296	7	6	…
2022/6/1 16:39	1070713	8	7	…
2022/6/1 17:40	1090884	9	4	…

Fields in table T:

Field name	Data type	Description
Time	Datetime	Time stamp of an event, accurate to milliseconds
UserID	String	User ID
ProductID	Integer	Product ID
Quantity	Numeric	Quantity

Dimension table Product:

ProductID	ProductName	Unit	Origin	ProductType
1	Apple	Pound	Shandong	Fruits
2	Tissue	Packs	Guangdong	Home&Personalcare
3	Beef	Pound	Qingdao	Meat
4	Wine	Bottles	Shanxi	Beverage
5	Pork	Pound	Xizang	Meat
6	Bread	Packs	Beijing	Bakery
7	Juice	Bottles	Xinjiang	Beverage
…	…	…	…	…

Fields of dimension table Product:

Field name	Data type	Description
ProductID	String	Product ID
ProductName	String	Product name
Unit	String	Unit
ProductTypeID	String	Product type
Origin	Numeric	Product origin

Computing task:

Find the sales quantity of each type of product of each place of origin within the specified time period.

One thing we should take into consideration is that the place of origin is not fixed, and get sales quantity of a type of product according to the place of origin in the current transaction.

Techniques involved:

Employing time key on the dimension table. You can find more about SPL time key mechanism in Why Are There So Many Snapshot Tables in BI Systems?.

1. Generate a dimension table with time key according to the dimension table in the production system at regular time interval.

Add a time key field to Product.btx. Here’s the dimension table’s new structure:

Field name	Data type	Description
eTime	Datetime	Effective time for the current record
ProductID	String	Product ID
ProductName	String	Product name
Unit	String	Sales unit
ProductTypeID	String	Product type
Origin	Numeric	Place of origin

Perform daily data dump on Product.btx with a time key according to the updated information in the production system’s Product table every day.

2. Summarize data using the dimension table with a time key

Define a new composite primary key using eTime and ProductID for the dimension table while specifying eTime as the time key. Table T is associated with Product through Time and ProductID fields. Programmers do not need to take care of how to achieve the time key because SPL is designed to automatically handle the related computing logic. They just use the dimension table as a regular one.

Sample code

1. Dump dimension table Product.btx.

For original data: Directly add the time key field.

	A
1	=connect("demo").query@x("select * from Product").derive(now():eTime)
2	= file("Product.btx").export@b(A1)

When data is updated: Import the new dimension table, get the record with the latest time under each type of product from the dumped btx file with the time key, compare them with the new table, and append a new record to the btx file for the updated data and record a new effective time.

	A
1	=connect("demo").cursor@x("select * from Product")
2	= T("Product.btx").keys@it(ProductID,eTime)
3	=A1.select(cmp(~.array(),A2.find(A1.ProductID).array().to(2,))!=0)
4	= A3.fetch().derive(now():eTime)
5	=file("Product.btx").export@ab(A4)

A1 Import the new dimension table and set index on the primary key.

A2 Import the old dimension table Product from the btx file, get record having the latest time for each product, and set index on the primary key.

@t option means the last key field is the time key.

A3 Get records under each primary key value from the new dimension table where the latest (largest) eTime is different from that in the records under same key value in the old dimension table. The find function will automatically select the record having the latest time corresponding to same key value.

A4 Add effective time field to A3’s records.

A5 Append and export A4’s records to the bin file.

2. Join the fact table and the dimension table and perform aggregation

Suppose T.ctx is already generated as the above explains and sorted by Time:

	A
1	>Product=T("Product.btx").keys@ti(ProductID,eTime)
2	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3	=file("T.ctx").open().cursor(ProductID,Quantity,Time;Time>=start && Time<=end)
4	=A3.switch(ProductID:Time,Product)
5	=A4.groups(ProductID.ProductType,ProductID.Origin; sum(Quantity):Quantity)

A1 Import dimension table Product, and set index on the primary key while specifying the time key.

A4 Join the fact table and the dimension table using the familiar SPL join syntax. As the joining fields contains a time field, the operation will find the record having the latest time (the largest time that is smaller than eTime) before the current time field value.

Execution result:

ProductType	Origin	Quantity
Fruits	Shandong	1241628
Fruits	Xinjiang	546357
Fruits	Hainan	24526
Home&Personalcare	Guangdong	7411008
Meat	Qingdao	3303230
Meat	Neimeng	657546
Meat	Xizang	2456235
Bakery	Beijing	247673
Beverage	Xinjiang	3526574
Beverage	Shanxi	6090112
…	…	…

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/cFTcUNs7

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Hoo • 434 View • 2 Years ago

User Behavior Analysis in Practice 8: The Changing Dimension Table

Target task

Techniques involved:

Sample code

ToC