"Target task We have a user events table T. Below is its structure and part of its data: Time Use .."

Hoo RaqForum 19 No.
1 Reply • 372 View • 2 Years ago

User Behavior Analysis in Practice 7: Dimension Table Filtering

Target task

We have a user events table T. Below is its structure and part of its data:

Time	UserID	ProductID	Quantity	…
2022/6/1 10:20	1072755	1	7	…
2022/6/1 12:12	1078030	2	8	…
2022/6/1 12:36	1005093	3	3	…
2022/6/1 13:21	1048655	4	9	…
2022/6/1 14:46	1037824	5	5	…
2022/6/1 15:19	1049626	6	4	…
2022/6/1 16:00	1009296	7	6	…
2022/6/1 16:39	1070713	8	7	…
2022/6/1 17:40	1090884	9	4	…

Fields in table T:

Field name	Data type	Description
Time	Datetime	Time stamp of an event, accurate to milliseconds
UserID	String	User ID
ProductID	Integer	Product ID
Quantity	Numeric	Quantity

Dimension table Product:

ProductID	ProductName	Unit	Price	ProductTypeID
1	Apple	Pound	5.5	1
2	Tissue	Packs	16	2
3	Beef	Pound	35	3
4	Wine	Bottles	120	4
5	Pork	Pound	25	3
6	Bread	Packs	10	5
7	Juice	Bottles	6	4
…	…	…	…	…

Fields of dimension table Product:

Field name	Data type	Description
ProductID	String	Product ID
ProductName	String	Product name
Unit	String	Sales unit
Price	Numeric	Unit price
ProductTypeID	Integer	Product type ID

Computing task:

Find the number of sold pieces for each product under type IDs 1, 2 and 3 within a specified time period.

Techniques involved:

1. Filtering dimension table before join

We first filter the dimension table, and then join it with the fact table. Thereisnoneedtojudgethespecifiedconditionondimensiontableforfacttablerecordsthatdonotmatchthefiltereddimensiontable. In this way the number of comparisons will decrease considerably.

2. Performing join in cursor

Join the fact table’s cursor with the filtered dimension table, during which fact table records that cannot associate with the dimension table will not be generated. This further reduces time spent in generating records in the joining result set.

3. Index reuse

Use the existing index on the filtered dimension table. When the filtering result set is still large, which means only a small number of records are filtered away, the technique saves your time for re-creating the index.

4. Alignment sequence

We can use an efficient alignment sequence to filter a numberized dimension table by generating a same-length table sequence, whose member values are whether the corresponding dimension records meet the filtering condition. When joining the dimension table with the fact table, we can directly find whether a fact table record matches or not according to its ordinal number. This saves time for performing the join and comparisons.

Sample code

Suppose T.ctx and Product.btx are already created as above introduces. T.ctx is ordered by Time and Product.btx is ordered by ProductID.

1. Filter dimension table Product and join it with T; delete records from table T that do not have matching records in the dimension table during joining.

	A
1	>Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID)
2	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3	=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end)
4	=A3.switch@i(ProductID,Product:ProductID)
5	=A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A1 Import the dimension table and filter it, and set index on the primary key.

A4 Join the fact table and the dimension table; @i option enables deleting records from the fact table that do not match the dimension table.

2. Move the association action in step 1 to the cursor generation statement, during which fact table records that cannot match won’t be generated in the cursor.

	A
1	>Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID)
2	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3	=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:Product)
4	=A3.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A3 Move the action of association between the fact table and the dimension table to the cursor generation statement; fact table records that are not matched won’t be generated in the result set.

3. Use the existing index on filtered dimension table Product

	A
1	>Product=file("Product.btx").import@b().keys@i(ProductID)
2	=Product.select@i([1,2,3].pos(ProductTypeID)!=null)
3	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4	=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:A2)
5	=A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A1 Import dimension table Product and set index on primary key.

A2 @i option enables reusing the existing index on the filtered dimension table.

Index reuse does not always make the computation faster. As records that are filtered away need to be deleted from the index table, it takes some time to perform the delete action when there are a lot of records are discarded (the number of desired records is small). In this case re-creating the index may be faster. It is important that we choose an appropriate method according to the actual situation.

4. Use alignment sequence as primary key values of dimension table Product are ordinal numbers

	A
1	>Product=file("Product.btx").import@b()
2	Product.([1,2,3].pos(ProductTypeID))=
3	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4	=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end)
5	=A4.select(A2(ProductID))
6	=A5.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity)

A1 Import dimension table Product without setting index on the primary key.

A2 Generate an alignment sequence according to Product table. Each of its values is whether the current record meets the filtering condition or not.

A5 Directly locate the target member in A2 according to table T’s ProductID to find out whether the current record satisfies the filtering condition.

5. Move the alignment-sequence-based filtering action to the cursor generation statement

	A
1	>Product=file("Product.btx").import@b()
2	Product.([1,2,3].pos(ProductTypeID))=
3	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4	=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end && A2(ProductID))
5	=A4.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity)

A4 Move the alignment-sequence-based filtering action to the cursor generation statement. Records that do not meet the filtering condition won’t be generated.

Execution result:

ProductID	ProductName	Quantity
1	Apple	206938
2	Tissue	463188
3	Beef	94378
5	Pork	217504
…	…	…

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/cFTcUNs7

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Hoo • 372 View • 2 Years ago

User Behavior Analysis in Practice 7: Dimension Table Filtering

Target task

Techniques involved:

Sample code

ToC