SPL optimizes funnel analysis of e-commerce from more than 3 minutes to 10 seconds
Problem description
In the e-commerce company A, conversion funnel analysis is a common and important statistical requirement.
When users use their smart devices to shop online, the system will establish a connection to form a session. Each session contains a number of events such as website visiting, product page viewing and order placing. The events of each user occur in a certain order. The later the event occurs, the fewer the number of users involved in the event, just like a funnel. Conversion funnel analysis is to count the number of distinct users of each event first, and then perform further calculations like conversion rate based on the counting result.
Problem analysis
It is very complicated to code in SQL to perform funnel analysis. Even if the code is finally written, we have to use a subquery in each step and join repeatedly, resulting in the difficulty to understand code and inefficient execution.
The table structure of the e-commerce company A is desensitized and simplified into a table T, which includes the fields: event number (id), user number (gid), event time (etime) and event type (eventtype).
After making slight modifications to the SQL statement for three-step funnel analysis, and writing it in Oracle syntax, the result is roughly as follows:
with e1 as (
select gid,1 as step1,min(etime) as t1
from T
where etime>= to_date('2021-01-10', 'yyyy-MM-dd') and etime<to_date('2021-01-25', 'yyyy-MM-dd') and eventtype='eventtype1' and …
group by 1
),
e2 as (
select gid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
from T as e2
inner join e1 on e2.gid = e1.gid
where e2.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e2.etime<to_date('2021-01-25', 'yyyy-MM-dd') and e2.etime > t1 and e2.etime < t1 + 7 and eventtype='eventtype2' and …
group by 1
),
e3 as (
select gid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
from T as e3
inner join e2 on e3.gid = e2.gid
where e3.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e3.etime<to_date('2021-01-25', 'yyyy-MM-dd') and e3.etime > t2 and e3.etime < t1 + 7 and eventtype='eventtype3' and …
group by 1
)
Select
sum(step1) as step1,
sum(step2) as step2,
sum(step3) as step3
from
e1
left join e2 on e1.gid = e2.gid
left join e3 on e2.gid = e3.gid
Sub-query e1 firstly filters the data whose event type is eventtype1 in a specified period of time form table T. The starting and ending time of the specified period usually are the given parameters, which change dynamically. “and…” means to have the possibility to satisfy other conditions, and sub-queries e2 and e3 also need to satisfy the same condition. Then the data are grouped by gid, and the smallest etime of each group are retrieved as the time t1 of the first step of conversion funnel, and the value of new field step1 are defined as 1.
Sub-query e2 uses table T (also known as e2) to inner join with e1, and the join field is gid. It filters the data whose e2.etime is less than 7 days after t1 and event type is eventtype2 in the specified period. The 7-day here is called the funnel window period, which can also be the given parameter. Then the data are group by gid, and the smallest e2.etime of each group are retrieved as the time t2 of the second step of conversion funnel, and the value of new field step2 are defined as 1.
Sub-query e3 uses T table (also known as e3) to inner join with e2, and the join field is gid. It filters the data whose e3.etime is larger than t2 as well as less than 7 days after t1 (funnel window period), and event type is eventtype3 in the specified period. Then the data are grouped by gid, the smallest e3.etime of each group are retrieved as the time t3 of the third step of conversion funnel, and the value of new field step3 are defined as 1.
At last, the database uses e1 to left join with e2 and e3, the join field is gid, and aggregates step1, step2, and step3 to get a sum.
Here are the 3 steps of funnel conversion analysis. If you want to accomplish more steps of funnel conversion analysis, please refer to e3 to write sub-queries like e4, e5, and so on, and main query needs to be added with the corresponding main code of join and aggregation.
Generally speaking, the result set of grouping table T of a specific period by gid tends to be big. To perform big group requires buffering on external storage, which results in bad computation performance.
In the actual environment of e-commerce company A, table T involves more than 300 million pieces of data every month. Executing this SQL statement on Snowflake’s Medium cluster (4 nodes) didn’t get a result in three minutes.
Solution
1. Presorting and order-based algorithm
We need to presort the original data orderly according to the grouping field in the process of data preparation. When performing conversion funnel analysis, the algorithm traverses the ordered data which satisfy the conditions like the time period and the event type, reads the data of each group into memory in turn. Specifically, step 1: it sorts the current grouped data in time order; step 2: it finds the first record of the first event type in the current group, records the time as t1, and assigns t1 to the first member of the current grouping result set. If t1 cannot be found, then the current group is skipped; step 3: in the records of the second event type in the current group, it finds the first record whose time is after t1 and before t1+7 (the funnel window period), and assigns its time t2 to the second member of the result set (if t2 cannot be found, then the assignment will be null); step 4: in the records of the third event type in the current group, it finds the first record whose time is after t2 and before t2+7, and assigns its time t3 to the third member of the result set (if t2 is null or t3 cannot be found, then the assignment will be null); step 5: it aggregates the three members of the result set per group to get the final result. This method traverses the data only once and avoids performing grouping with a big result, thus improving the performance greatly.
We can also sort the original data by both grouping field and time field beforehand and spare the resorting of each group, which improves the performance less effectively, but simplifies the code obviously.
In fact, many grouping fields of in-group time-series calculation are certain (like the user ID and the account number) rather than randomly chosen. Sorting data by these certain fields can help many situations of in-group time-series calculation. Other in-group time-series calculations only differ in the specific algorithms, which will be introduced on their own.
Presorting is slow yet one-off and holds only one kind of storage without redundant data.
SQL, based on the unordered set, cannot ensure the data of each group are orderly stored, therefore, it cannot directly use the order-based algorithm.
2. New-added data
Newly-added data are not always ordered by the grouping field, so they should not be simply appended to the end of the already ordered data. And it will be very time-consuming to directly do an ordinary full sorting of the existing data and the new data together.
Instead, we can sort the newly-added data first and then, along with the already ordered data, use low-cost order-based MERGE algorithm to generate a new ordered data table. In this way, the overall degree of complexity equals to reading and writing the whole data once, which avoids the frequent temporary external buffering in ordinary full sorting, thus improving the performance greatly.
Furthermore, we can keep a small-scale ordered table additionally (hereinafter referred to as patch data), which will merge with the newly-added data while keeping the old ordered data unchanged. The patch data will merge with the old ordered data until they accumulate to a suitable size after a while. When performing the time-series calculation in the group, the algorithm reads from the old ordered data and patch data respectively, merges them and then calculates. In this manner, the performance will be a bit slower compared with handling one table of ordered data, but the data orderliness can still make the computation much faster.
The time when to merge the patch data with the old ordered data is related to the cycles when data should be added. If new data are added every day, we can do the merge once a month. The patch data store data of one month at most and the existing ordered data contain all data one month ago. That is, the patch data may be far smaller than the old ordered data, so the data to be merged every day is relatively small and the data appending is fast. As we do the whole-data order-based merge only once a month, it is fairly acceptable that it takes a littler longer to complete.
Technology selection
SQL, which is commonly used in relational databases and big data technologies, is based on unordered set theory, and cannot ensure the data of each group are orderly stored, therefore, it cannot directly use the above-mentioned group- and time-based ordered algorithm.
If we use high-level languages like Java to implement such algorithm, the code will be huge in amount and difficult to be universal.
esProc SPL, a professional data calculation engine, supports physically ordered storage of data, making it easy to implement the above-mentioned algorithm. Moreover, SPL encapsulates a large number of data computing functions and can implement funnel analysis with very simple code.
Actual effect
When performing a 3-step funnel analysis of 14-day span on GCP’s 16C128G virtual machine, the result is obtained in 10 seconds, which meets and exceeds customer expectations.
In terms of development difficulty, SPL has made a lot of encapsulations, provided rich functions and built in basic algorithms required by the above scheme. The SPL code for 3-step funnel analysis is roughly as follows:
A | B | |
1 | =["eventtype1","eventtype2","eventtype3"] | =file("T.ctx").open() |
2 | =B1.cursor(gid,etime,eventtype;etime>=date("2021-01-10") && etime<date("2024-01-25") && A1.contain(eventtype) && …) | |
3 | =A2.group(gid).(~.sort(etime)) | =A3.new(~.select@1(eventtype==A1(1)):first,~:all).select(first) |
4 | =B3.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(eventtype==A1.~ && etime>t && etime<t1+7).etime, null)))) | |
5 | =A1.("count(~("/#/")):STEP"/#).concat@c() | |
6 | =A4.groups(;${A5}) |
This code is shorter and more flexible than the previous SQL code and can stay the same even if adding a few more steps of funnel analysis.
A1: define the three event types, which can also be given parameters.
B1: open the composite table T.ctx.
A2: create a cursor to filter the data satisfying the conditions like time period and event type. These conditions can be given parameters.
A3: define the grouping operation, and sort each group by etime.
B3: create a new cursor, name the first data of the first event type in each group as “first” field and the original group as “all” field. Then define filtering to skip the data whose “first” field is null.
A4: define the calculation of B3. Loop through each record according to A1. The first loop assigns the time of “first”, which is the earliest time of the first event type, to variables t1, t and the first member of the result set at the same time. The second loop finds the first record whose eventtype is the second event type in “all” and etime is larger than t and smaller than t1+7. Its etime is assigned to t and the second member of the result set (if t cannot be found, the assignment will be null). The third loop finds the first record whose eventtype is the third event type in “all” and etime is larger than t and smaller than t1+7 if t is not null. Its etime is assigned to t and the third member of the result set (if t is null or cannot be found, the assignment will be null). Please note that the 7 days (funnel window period) here can also be given parameter.
A5: based on the number of funnel analysis steps of A1, dynamically generate the count statement: count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3.
A6: execute the previously defined calculation and aggregate the three members of each result set in each group to return a small result set and count the number. The count code generated by A5 is utilized, and the actual executed code is:
=A4.groups(; count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)
Here are still 3 steps of conversion funnel analysis. To accomplish more steps, you can add the event types of following steps in the event type sequence of A1, for example, [“eventtype1”,“eventtype2”,“eventtype3”,“eventtype4”,“eventtype5”…]. The code can stay the same if the event type sequence is the given parameter. This method is much simpler compared with adding sub-queries and modifying the main query in SQL.
Postscript
Funnel analysis problems involve large numbers of users, and the event table is very large and needs to be stored externally in general. However, the data amount of each user is not large, ranging from a few to several thousand pieces of data.
Moreover, although the calculation of events is complex, it is done on a user-by-user basis and, there is no relationship between the events of different users, and the calculation of one user’s events generally doesn’t involve the events of other users.
We can utilize the ordered storage mechanism of SPL to separately load each user’s data into memory for computation, which can reduce the computing complexity and effectively improve performance.
In doing so, conversion funnel analysis can be done by traversing the data only once, which is fast in calculation and takes up very little memory space. Moreover, such computing logic is perfectly in line with our natural thinking and can reduce the coding difficulty.
SQL, which is commonly used in relational databases and big data technologies, is based on unordered set theory, and cannot ensure the data of each user are orderly stored, therefore, it cannot directly use the above-mentioned user- and time-based ordered algorithm. If we use high-level languages like Java to implement such algorithm, the code will be huge in amount and difficult to be universal.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version