From SQL to SPL: Align the existing data to the corresponding position and fill in any missing data with 0
The MySQL database has a sampling table, where each ITEM and CITY is a sampling task. Each sampling task includes 1 to 5 records, indicating that data was collected within 5 weeks. START_Y and START_W are the year and week when sampling started, while FIRST_USE_Y and FIRST_USE_W are the year and week when data was actually collected. The calculation rule for week numbers: Starting from January 1st, every 7 days counts as a week and accumulates sequentially.
ITEM |
CITY |
START_Y |
START_W |
FIRST_USE_Y |
FIRST_USE_W |
VALUE |
A |
NEW YORK |
2023 |
30 |
2023 |
32 |
15000 |
A |
LONDON |
2024 |
2 |
2024 |
2 |
12000 |
A |
LONDON |
2024 |
2 |
2024 |
5 |
50000 |
B |
NEW YORK |
2023 |
49 |
2024 |
1 |
19540 |
B |
MADRID |
2023 |
10 |
2023 |
11 |
15444 |
Now we need to expand each sampling task into 5 records (weeks), increasing sequentially from the year and week of sampling, including the weeks where data was actually collected and those where data was not collected. The former should be aligned to the corresponding position, while the latter has a VALUE of 0.
ITEM |
CITY |
START_Y |
START_W |
FIRST_USE_Y |
FIRST_USE_W |
VALUE |
A |
NEW YORK |
2023 |
30 |
2023 |
30 |
0 |
A |
NEW YORK |
2023 |
30 |
2023 |
31 |
0 |
A |
NEW YORK |
2023 |
30 |
2023 |
32 |
15000 |
A |
NEW YORK |
2023 |
30 |
2023 |
33 |
0 |
A |
NEW YORK |
2023 |
30 |
2023 |
34 |
0 |
A |
LONDON |
2024 |
2 |
2024 |
2 |
12000 |
A |
LONDON |
2024 |
2 |
2024 |
3 |
0 |
A |
LONDON |
2024 |
2 |
2024 |
4 |
0 |
A |
LONDON |
2024 |
2 |
2024 |
5 |
50000 |
A |
LONDON |
2024 |
2 |
2024 |
6 |
0 |
B |
NEW YORK |
2023 |
49 |
2023 |
49 |
0 |
B |
NEW YORK |
2023 |
49 |
2023 |
50 |
0 |
B |
NEW YORK |
2023 |
49 |
2023 |
51 |
0 |
B |
NEW YORK |
2023 |
49 |
2023 |
52 |
0 |
B |
NEW YORK |
2023 |
49 |
2024 |
1 |
19540 |
B |
MADRID |
2023 |
10 |
2023 |
10 |
0 |
B |
MADRID |
2023 |
10 |
2023 |
11 |
15444 |
B |
MADRID |
2023 |
10 |
2023 |
12 |
0 |
B |
MADRID |
2023 |
10 |
2023 |
13 |
0 |
B |
MADRID |
2023 |
10 |
2023 |
14 |
0 |
SQL:
WITH ItemCity As (
SELECT Item, City, MIN( DATEADD(day, Start_W*7, DATEFROMPARTS(Start_Y, 1, 1)) ) As StartWeek
FROM Data
GROUP BY Item, City
),
ItemCityWeeks As (
SELECT Item,City, StartWeek
,Year(StartWeek) As Start_Y,datepart(week, StartWeek)-1 As Start_W
,YEAR(DATEADD(day, Weeks.num*7, StartWeek)) As First_Use_Y
,DATEPART(dayofyear, DATEADD(day, Weeks.num*7, StartWeek))/7 As First_Use_W
FROM ItemCity
CROSS JOIN ( VALUES (0), (1), (2), (3), (4)) Weeks(num)
)
SELECT icw.Item, icw.City
, icw.Start_Y, icw.Start_W, icw.First_Use_Y, icw.First_Use_W
, coalesce(d.value, 0) as Value
FROM ItemCityWeeks icw
LEFT JOIN Data d ON d.Item = icw.Item AND d.City = icw.City
and d.First_Use_Y = icw.First_Use_Y and d.First_Use_W = icw.First_Use_W
ORDER BY Item, City DESC
After SQL grouping, it must aggregate immediately. It cannot keep the grouped subsets and expand each subset into N records, and then simply align the VALUE using filtering methods. It can only solve it by taking a detour: First group and aggregate and then expand, and cross multiplication should be used for expansion. When aligning VALUE, the indirect implementation of multi field join is necessary, and the structure is very complex and the code is also verbose.
SPL code is much simpler and easier to understand:
A |
|
1 |
=mySQL.query("select * from data") |
2 |
=A1.group@u(ITEM,CITY) |
3 |
=A2.news(5;ITEM,CITY,START_Y,START_W, (d=elapse(date(START_Y,1,1),7*START_W+(~-1)*7),year(d)):FIRST_USE_Y, int(ceil(interval(pdate@y(d),d)/ 7)):W, ifn(A2.~.select@1(FIRST_USE_W==W).VALUE,0):VALUE) |
4 |
=A3.rename(W:FIRST_USE_W) |
A1: Load data.
A2: Group by task, and it can retain the grouped subsets without aggregation.
A3: Expand each group of data directly into 5 records, calculate the week number according to the rules, and simply filter out the VALUE corresponding to the current week number. Function pdate@y returns the first day of the year in which the date is located, ifn returns the first non-null member, select@1 filters out the first record that meets the criteria.
Question source:https://stackoverflow.com/questions/78263465/how-to-resample-an-sql-database
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL