"Target task: We have a user events table T. Below is its structure and part of its data: Time Us .."

pjf RaqForum 22 No.
1 Reply • 496 View • 2 Years ago

User Behavior Analysis in Practice 6: Numberizing the Dimension Table

Target task:

We have a user events table T. Below is its structure and part of its data:

Time	UserID	EventTypeID	ProductID	Quantity
2022/6/1 10:20	1072755	3	100001
2022/6/1 12:12	1078030	2	100002
2022/6/1 12:36	1005093	5	100003	3
2022/6/1 13:21	1048655	1
2022/6/1 14:46	1037824	6
2022/6/1 15:19	1049626	4	100004	4
2022/6/1 16:00	1009296	5	100005	6
2022/6/1 16:39	1070713	2	100006
2022/6/1 17:40	1090884	3	100007

Fields in table T:

Field name	Data type	Description
Time	Datetime	Time stamp of an event, accurate to milliseconds
UserID	String	User ID
EventTypeID	Integer	Event type ID
ProductID	String	Product ID
Quantity	Numeric	Quantity

Dimension table EventType:

EventTypeID	EventType
1	Login
2	Browse
3	Search
4	AddtoCart
5	Submit
6	Logout

Dimension table Product:

ProductID	ProductName	Unit	Price	ProductTypeID
100001	Apple	Pound	5.5	1
100002	Tissue	Packs	16	2
100003	Beef	Pound	35	3
100004	Wine	Bottles	120	4
100005	Pork	Pound	25	3
100006	Bread	Packs	10	5
100007	Juice	Bottles	6	4
…	…	…	…	…

Fields in dimension table Product:

Field name	Data type	Description
ProductID	String	Product ID
ProductName	String	Product name
Unit	String	Sales unit
Price	Numeric	Unit prices
ProductTypeID	Integer	Product type ID

Dimension table ProductType:

ProductTypeID	ProductType
1	Fruits
2	Home&Personalcare
3	Meat
4	Beverage
5	Bakery
…	…

Relationship between tables:

Computing task:

Calculate the total sales amount, number of orders, search frequency and the number of distinct users performing search and ordering under each type of product within a specified time period.

Techniques involved:

1. Associate with dimension table using ordinal-number-based location.

In both EventType and ProductType tables, primary keys are ordinal natural numbers. They can be directly used to locate records and achieve associations without creating index and computing and comparing HASH values. This can boost performance.

2. Convert a dimension table’s primary key values that are non-ordinal-numbers into ordinal natural numbers so that ordinal-number-based location can be used to speed up association.

Product table’s primary key values are not ordinal numbers represented by natural numbers, but we can first transform them into ordinal numbers and, in the meantime, convert ProductID field values in user events table T into corresponding ordinal number. Now ordinal-number-based location can be used.

Sample code

The code has five parts:

1. According to practices in previous essays, we dump data in those dimension tables as bin files EventType.btx, Product.btx and ProductType.btx that are respectively ordered by EventTypeID, ProductType and ProductTypeID.

2. Join user events table T and EventType table, Product table and ProductType table through ordinal numbers.

3. Add an ordinal number field to Product table.

4. Dump data from user events table T to store it in composite table T.ctx ordered by Time field, and transform ProductID values into ordinal numbers of corresponding records in Product table.

5. Import each dimension table into memory, open the composite table cursor, establish associations with dimension tables during which the association with Product table is through ordinal numbers. It is not necessary to set primary key and create index for dimension table Product.

1. Dump dimension table data

	A
1	=connect("demo")
2	=A1.query("select * from Product")
3	=file("Product.btx").export@b(A1)
4	=A1.query("select * from EventType order by EventTypeID")
5	=file("EventType.btx").export@b(A4)
6	=A1.query@x("select * from ProductType order by ProductTypeID")
7	=file("ProductType.btx").export@b(A6)

A4 Sort by EventTypID.

A6 Sort by ProductTypID.

2. Change types of association between EventType table and table T, ProductType table and table T into ordinal number-based association.

	A
1	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2	=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 \|\| EventTypeID==3))
3	>EventType=file("EventType.btx").import@b()
4	>ProductType=file("ProductType.btx").import@b()
5	>Product=file("Product.btx").import@b().keys@i(ProductID)
6	>Product=Product.switch(ProductTypeID, ProductType:#)
7	=A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:#)
8	=A7.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A3-A4 Do not set create indexes on primary keys for EventType and ProductType.

A6 Change type of association between ProductType table and table T into ordinal number-based association.

A7 Change type of association between EventType table and table T into ordinal number-based association.

3. Add an ordinal number field in Product table.

Original data: add an ordinal number field to it directly.

	A
1	=connect("demo").query@x("select * from Product").derive(#:ProductNum)
2	= file("Product.btx").export@b(A1)

Updated data: Whenever a dimension table is updated, it is wholly retrieved and compared with the dumped btx file. The comparison is based on ordinal numbers of the dumped records, otherwise historical data will be mismatched. The newly-increased data is placed at the end. Usually there are no deletion actions on dimension tables. Any deletion of dimension data will cause errors when historical records of the fact table try to reference records of the dimension table.

	A
1	.keys@i(ProductID)=connect("demo").query@x("select * from Product").derive(:ProductNum)
2	= file("Product.btx").import@b().keys@i(ProductID)
3	=A1.select(A2.find(A1.ProductID)==null)
4	=A2.(if(r=A1.find(A2.ProductID),r,~) )
5	=(A4\|A3).run(ProductNum=#)
6	=file("Product.btx").export@b(A5)

A1 Load the updated dimension table, add ProductNum field and set primary key and index.

A2 Load the original dimension table Product from the corresponding bin file and set primary key.

A3 Find the newly-increased records in the updated dimension table.

A4 If a record in the original dimension table exist in the updated one, use the new record; if it does not exist, use the original record.

A5 Union A4 and A3 and set ordinal numbers. As A4 keeps the order of the original dimension table, ordinal numbers in the original dimension table are retained.

A6 Write A5’s result to a bin file.

4. The code of preparing file T.ctx, during which ProductID field values are changed into ordinal numbers.

Take stocked data as an example:

	A
1	>Product=file("Product.btx").import@b().keys@i(ProductID)
2	=connect("demo").cursor@x("select * from T order by Time")
3	=A2.run(ProductID=Product.find(A2.ProductID).ProductNum)
4	=file("T.ctx").create@y(#Time,UserID,EventTypeID, ProductID, Quantity)
5	.append(A3)=A4
6	>A4.close()

A1 Load dimension table Product into the memory and create index on primary key.

A2 Sort table T by time while retrieving data from it.

A3 Replace ProductID field values in table T with ordinal numbers of corresponding records in the dimension table.

A4 Create a composite table.

A5 Append data of table T to A4’s composite table.

Similar code for the newly-increased data.

5. Perform the whole analysis on the converted data, all through ordinal numbers.

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:

	A
1	>EventType=file("EventType.btx").import@b()
2	>ProductType=file("ProductType.btx").import@b()
3	>Product=file("Product.btx").import@b()
4	>Product=Product.switch(ProductTypeID, ProductType:#)
5	=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 \|\| EventTypeID==3))
6	=A5.switch(ProductID,Product:#; EventTypeID,EventType:#)
7	=A6.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A1-A3 Do not set primary keys.

A6 A join via ordinal numbers.

Execution result:

EventTypeID	ProductTypeID	EventType	ProductType	Quantity	Num	iNum
3	1	Search	Fruits	0	499586	48735
3	2	Search	Home&Personalcare	0	508897	49872
3	3	Search	Meat	0	403213	39923
3	4	Search	Beverage	0	324567	29045
3	5	Search	Bakery	0	335498	30234
…	…	…	…	…	…	…
5	1	Submit	Fruits	206938	103469	13523
5	2	Submit	Home&Personalcare	463188	154396	14656
5	3	Submit	Meat	94378	93366	8754
5	4	Submit	Beverage	217504	54376	5233
5	5	Submit	Bakery	339480	67896	5844
…	…	…	…	…	…	…

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

pjf • 496 View • 2 Years ago

User Behavior Analysis in Practice 6: Numberizing the Dimension Table

Target task:

Techniques involved:

Sample code

ToC