User Behavior Analysis in Practice 4: Using Column-wise Storage
Target task:
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
EventTypeID |
EventType |
Product |
Quantity |
Unit |
Price |
… |
2022/6/1 10:20 |
1072755 |
3 |
Search |
Apple |
5 |
Pound |
5.5 |
… |
2022/6/1 12:12 |
1078030 |
2 |
Browse |
Tissue |
4 |
Packs |
16 |
… |
2022/6/1 12:36 |
1005093 |
5 |
Submit |
Beef |
3 |
Pound |
35 |
… |
2022/6/1 13:21 |
1048655 |
1 |
Login |
… |
||||
2022/6/1 14:46 |
1037824 |
6 |
Logout |
… |
||||
2022/6/1 15:19 |
1049626 |
4 |
AddtoCart |
Wine |
4 |
Bottles |
120 |
… |
2022/6/1 16:00 |
1009296 |
5 |
Submit |
Pork |
6 |
Pound |
25 |
… |
2022/6/1 16:39 |
1070713 |
2 |
Browse |
Bread |
3 |
Packs |
10 |
… |
2022/6/1 17:40 |
1090884 |
3 |
Search |
Juice |
6 |
Bottles |
6 |
… |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
Integer |
User ID |
EventTypeID |
Integer |
Event type ID |
EventType |
String |
Event type name |
Product |
String |
Product name |
Quantity |
Numeric |
Quantity |
Unit |
String |
Unit |
Price |
Numeric |
Price |
…… |
…… |
…… |
Computing task:
Find the number of events under each type and that of distinct users who perform that type of event in the specified time period.
The first thing the task involves is to obtain data within a specified time period from huge amount of data.
Techniques involved:
1. Column-wise storage
The user events table has many fields but the analysis only involves a few of them. By using a composite table of column-wise storage format, we just need to read the involved fields. This can significantly speed up retrieval and reduce memory usage.
Generally, a table stored column-wise is also highly compressed, which helps to reduce disk access.
2. Order-based storage
This computing task involves retrieving and summarizing data within a specified time range. Using Time as the dimension of the composite table and storing data according to it can much increase the filtering speed.
A composite table stores data in blocks. Each block stores the maximum and minimum values of the dimension field in it (which we call minmax index). A filtering condition on the dimension field can quickly identify and skip the ineligible blocks.
The order-based storage can increase compression ratio further and then decrease disk data access.
3. Pre-filtering before retrieval with cursor
Generating data objects takes time. We can judge data according to the filtering condition before retrieval from the cursor in an effort to minimize the unnecessary data object generations.
Sample code
1. Dump data from database and store it in a composite table file of column-wise format and ordered by time.
Stocked data: the data is retrieved from the database, sorted by time and written to a composite table file of column-wise format.
A |
|
1 |
=connect("demo").cursor@x("select * from T order by Time") |
2 |
=file("T.ctx").create@y(#Time,UserID,EventTypeID,EventType, Product, Quantity, Unit, Price, ……) |
3 |
=A2.append(A1) |
4 |
>A2.close() |
A1 Sort table T by time while retrieving data from it.
A2 Generate composite table file T.ctx. create()creates the composite table file’s structure, where #Time means that Time field acts as the dimension and data is stored in its order. Generally, ordered dimensions should be written before all the other fields involved (there are cases where the dimension fields do not take the lead, but for the time being we take it for granted that they must head the fields). A composite table’s structure needs to be explicitly written at its creation, which is different from creating a bin file.
A3 Fetch A1’s data from the cursor and append it to composite table file T.ctx. Make sure that the to-be-appended data should be already ordered (it needs to be sorted in A1) as writing data to a composite table does not check the orderliness.
A4 Close A2’s composite table. This is necessary after finishing writing data to a composite table.
Newly-increased data: The newly-increased data is identified through time stamp. Each day after 0 o’clock we append the newly-generated data in the past day to a composite table file:
A |
|
1 |
=connect("demo").cursor@x("select * from T where Time>=? && Time<? order by Time",date(now()-1), date(now())) |
2 |
=file("T.ctx").open() |
3 |
=A2.append(A1) |
4 |
>A2.close() |
A1 Get data generated in the previous day through filtering condition to create a cursor. Data needs to be first sorted by time stamp.
A2 Open the composite table file.
A3 Fetch A1’s data from the cursor and append it to composite table file T.ctx. As the newly-increased data is identified according to time stamp, it is naturally generated later. There’s no need to sort-merge the new data and the old data. Just append the new data to the original file.
A4 Close the composite table.
2. Perform filtering on the composite table and then grouping & aggregation
Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:
A |
|
1 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
2 |
=file("T.ctx").open().cursor(UserID,EventTypeID,EventType;Time>=start && Time<=end).groups(EventTypeID; EventType, count(1):Num, icount(UserID):iNum) |
A2 Retrieve data within the specified time range from the composite table file and group and summarize it. As no data writing is involved here, the close() action on the composite table is not needed.
Since only a few fields participate in computation this case, they need to be listed in cursor function to avoid retrieving all fields. Note that there is no need to select fields engaging in only the filtering but not the subsequent grouping & aggregation.
The filtering condition should be written in cursor(). This will judge data according to the condition, decide whether or not to generate a result record (in this case the aforementioned block skipping action will be performed automatically through comparing with the dimension field) and skip the ineligible data. The code cursor().select(…) will first generate a result set and then filter it. In that case objects of ineligible data will also be generated.
Execution result:
EventTypeID |
EventType |
Num |
iNum |
1 |
Login |
4033000 |
393400 |
2 |
Browse |
3578901 |
348791 |
3 |
Search |
2947931 |
257539 |
4 |
AddtoCart |
1845674 |
175476 |
5 |
Submit |
867345 |
83375 |
6 |
Logout |
4033000 |
393400 |
3. Use parallel processing to further boost performance
A |
|
1 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
2 |
=file("T.ctx").open().cursor@m(UserID,EventTypeID,EventType;Time>=start && Time<=end;2).groups(EventTypeID; EventType, count(1):Num, icount(UserID):iNum) |
A2 uses cursor@m(...;...;n) to implement parallel processing; @m option enables generating a multicursor for performing multithreaded processing, and n represents the number of threads a cursor is split, which is generally less than the number of CPU cores. Use the number of parallel tasks in system configurations when parameter n is absent.
Execution result:
EventTypeID |
EventType |
Num |
iNum |
1 |
Login |
4033000 |
393400 |
2 |
Browse |
3578901 |
348791 |
3 |
Search |
2947931 |
257539 |
4 |
AddtoCart |
1845674 |
175476 |
5 |
Submit |
867345 |
83375 |
6 |
Logout |
4033000 |
393400 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL