From SQL to SPL: Count distinct within intervals in order
The Teradata database table has three fields: date, account, and the name of the product purchased by the account.
Date |
ACCOUNT |
NAME |
2024-01-01 |
A |
XOXO |
2024-01-02 |
A |
XOXO |
2024-01-02 |
A |
OXXO |
2024-01-04 |
A |
XOOX |
2024-01-05 |
A |
OOXO |
2024-01-06 |
A |
XOXO |
2024-01-01 |
B |
B11 |
2024-01-02 |
B |
B21 |
2024-01-02 |
B |
B21 |
2024-01-02 |
B |
B22 |
2024-01-02 |
B |
B11 |
2024-01-03 |
B |
B31 |
2024-01-01 |
C |
C1 |
2024-01-02 |
C |
C1 |
2024-01-03 |
C |
C1 |
Now we need to calculate how many types of products each account has purchased before each day, that is, the number of different products purchased from the beginning to the previous day.
Date |
ACCOUNT |
COUNT_DISTINCT_NAME |
2024-01-01 |
A |
Null |
2024-01-02 |
A |
1 |
2024-01-04 |
A |
2 |
2024-01-05 |
A |
3 |
2024-01-06 |
A |
4 |
2024-01-01 |
B |
Null |
2024-01-02 |
B |
1 |
2024-01-03 |
B |
3 |
2024-01-01 |
C |
null |
2024-01-02 |
C |
1 |
2024-01-03 |
C |
1 |
Group by ACCOUNT, then by date, process each group in turn, get the interval from first group from to the current group, and perform deduplication and counting. But SQL grouping must be accompanied by aggregation, and cannot retain grouped subsets for further processing; At this point, it is necessary to change the mindset, set various markers to transition, and write multi-layer nested statements to repeatedly traverse the data table in order to calculate.
SPL allows grouping to retain grouped subsets for further processing. SPL's cross row references are also much simpler than window functions:
A |
|
1 |
=teraJDBC.query("select * from tb”) |
2 |
=A1.group(ACCOUNT) |
3 |
=A2.conj(~.group(Date;ACCOUNT,~[:-1].conj().icount(NAME):COUNT_DISTINCT_NAME)) |
4 |
=A3.run(if(#3==0,#3=null)) |
A1: Load data through JDBC.
A2: Group by account.
A3: Process each group of A2: Continue grouping by date, process daily data, get data of groups from the beginning to the previous day, union these groups, and calculate the quantity of different product names. Finally, union the processing results of each upper-level group. [: -1] represents the set from the first member to the previous member, and the function icount is used for unique count, i.e. count (distinct)
A4: Change the 0 in column 3 to null.
Question source:https://stackoverflow.com/questions/78344447/get-the-distinct-names-an-account-has-prior-the-last-day-in-teradata-or-oracle-w
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL