How to speed up funnel analysis of e-commerce system
In the e-commerce system, the conversion funnel analysis is a very important data analysis calculation. The user of e-commerce platform will conduct multiple operations (events), including page viewing, searching, adding to cart, placing an order and paying, etc. These events occur in a certain order, and the later the event occurs, the fewer the number of users involved in the event, just like a funnel. Usually, the conversion funnel analysis is to count the number of users of each event first, and then do further calculations based on counting result, such as calculating the conversion rate. Since such analysis involves huge data, and the calculation is very complex, it often leads to performance problem.
To make funnel analysis run faster, we first need to understand the characteristics of calculation and data: the number of users of e-commerce system is very large, and hence an extremely large event table will be generated after the events of every user are recorded, and this table needs to be stored in external storage in general; In funnel analysis, the total amount of data involved in calculation is large and may reach tens of millions or even up to 100 million rows, yet the data amount of each user is not large, only a few to thousands of rows.
Moreover, although the calculation of events in funnel analysis is complex, it is done on a user-by-user manner and, there is no relationship between the events of different users, and the calculation of one user’s events generally doesn’t involve the events of other users. Therefore, we’d better load each user's data into memory separately for computation, which can reduce the computing complexity and effectively improve performance.
For one user, the calculation of multiple events is performed in chronological order, it is a typical time-sequence calculation. In some cases, the time-sequence calculation logic on events is extremely complex and requires writing multi-step code to implement, and thus it is more necessary to process on a user-by-user manner.
In order to be able to read the data of every user in sequence, it needs to pre-store the data of event table orderly by user on hard disk. When analyzing the data with conversion funnel, first read all data of one user into memory, and then sort the data by time, finally do the time-sequence calculation on events. For example, find the page viewing event first, and then see if it follows by an adding-to-cart event. If there is adding-to-cart event, and the time span between two events is less than 7 days, then it indicates this user has conducted two steps. After the data of this user is calculated, read the data of other users one by one for calculation until all data is fetched out.
In doing so, the conversion funnel analysis can be done during only one traversal of the data, and the calculation speed is fast and the memory space is less occupied. Moreover, such computing logic is perfectly in line with our natural thinking and can reduce the coding difficulty.
Although it is slow to pre-sort the event table, it only needs to sort once, and storing only one event table orderly by user will be enough, without redundancy.
SQL, which is commonly used in relational databases and big data technologies, is based on unordered set theory, and it cannot ensure the continuous storage of data of each user, and hence the above-mentioned user- and time-ordered algorithms cannot be adopted directly. If the high-level languages such as Java is employed to implement these algorithms, the amount of code will be huge and there are limitations in using such languages.
esProc SPL, a professional data calculation engine, supports storing the data orderly in physics, and it is easy for SPL to implement these algorithms. Moreover, SPL encapsulates a lot of data computing functions, and can implement funnel analysis with very simple code. Assuming that the event table T contains the following fields: event number (id), user number (uid), event time (etime) and event type (eventtype), SPL code for calculating three-step funnel is roughly as follows:
A |
B |
|
1 |
=["PageView","ProductDetail","Confirm"] |
=file("T.ctx").open() |
2 |
=B1.cursor(uid,etime,eventtype;etime>=date("2022-10-10") && etime<date("2022-10-25") && A1.contain(eventtype) && …) |
|
3 |
=A2.group(uid).(~.sort(etime)) |
|
4 |
=A3.new(~.select@1(eventtype==A1(1)):first,~:all).select(first) |
|
5 |
=A4.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(eventtype==A1.~ && etime>t && etime<t1+7).etime, null)))) |
|
6 |
=to(A1.len()).("count(~("/~/")):STEP"/~).concat@c() |
=A5.groups(;${A6}) |
A1: list three event types, you can also pass them as parameter. B1: open the composite table T.ctx.
A2: create a cursor to filter out the data that meets the conditions such as time span and event type. All the filtering conditions can be passed in as parameter.
A3 - B5: loop through each user to perform time-sequence calculation according to the method mentioned above, and finally aggregate the number of users who conduct three steps.
This is a three-step funnel conversion analysis. If you want to implement calculation with more steps, just add the event type of subsequent steps to the event type sequence of A1, for example: [ "SiteView","PageView","Login","Product Detail"and"Confirm"]. If the event type is passed to the event type sequence through parameter, the code can be left unchanged.
For the funnel analysis calculation, it usually needs to filter on time criteria first. Although the time span involving all data is long, the time span corresponding to the filtered data is relatively short. Since we have stored the entire event table orderly by user, we cannot quickly filter on time criteria. To tackle this problem, SPL provides a bi-dimension ordering structure, which allows entire event table to be ordered by time (implementing fast filter), and also allows the event table to be ordered by user when accessing the table (enabling you to easily take out user data one by one for subsequent calculations). With this structure, it is equivalent to implementing an event table that is ordered simultaneously by both time and user fields, thereby obviously speeding up funnel analysis. For details, visit: Speed up Big Data User Behavior Analysis using Bi-dimension Ordering Structure.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL