Group Records by a Date Sequence & Find Difference
【Question】
I have two tables: 1)users(id,registerdate) 2)user_answer(userid,answer,updated_date)
I want the count of zero usage per day. How many users are registering but not answering per day. Results will be like this:
Date registedCount notAnsweredCount
15-09-02 20 10
15-09-01 20 10
15-08-31 12 4
Data is like this for user table((1,‘15-09-01’),(2,‘15-09-01’),(3,‘15-09-01’)) and this for user answer table ((1,0,15-09-01)).. Here you can see three users are registered on the day of Sep. 01, 2015 but only one user answered one question. So, result will be (Date=>15-09-01, registedCount => 3, notAnsweredCount => 2)
Here’s an almost-correct solution:
SELECT date_range.aDay,
COUNT(DISTINCT users.id) AS registedCount,
SUM(IF(users.id IS NOT NULL AND user_answer.userid IS NULL, 1, 0)) AS notAnsweredCount
FROM
(
SELECT DATE_ADD('2015-09-01', INTERVAL units.aCnt + tens.aCnt * 10 DAY) AS aDay
FROM
(
SELECT 0 AS aCnt UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) units
CROSS JOIN
(
SELECT 0 AS aCnt UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) tens
) date_range
LEFT OUTER JOIN users
ON date_range.aDay = users.registerdate
LEFT OUTER JOIN user_answer
ON users.id = user_answer.userid
GROUP BY date_range.aDay
【Answer】
There are two key computations to get done: 1)group the source database tables by a specified date sequence, instead of one of its own fields; 2) find the difference between registered ids and userids that answer questions per day. The code is difficult to understand if we express them in SQL. Here I choose to phrase them in SPL. The Structured Process Language supports step-by-step coding, which is intuitive to understand:
Parameter settings:
SPL script:
A |
|
1 |
$select id,registerdate from users where registerdate>=? And registerdate<=?; argBegin,argEnd |
2 |
$select userid,updated_date from user_answer where updated_date>=? And updated_date<=?; argBegin,argEnd |
3 |
=periods(argBegin,argEnd) |
4 |
=A1.align@a(A3,registerdate).(~.(id)) |
5 |
=A2.align@a(A3,updated_date).(~.(userid)) |
6 |
=A3.new(~:Date,A4(#).len():registedCount,(A4(#)\A5(#)).len():notAnsweredCount) |
A1, A2: Retrieve data from the two source tables respectively with a SQL-like statement.
A3: periods() function generates a sequence of dates according to two parameters.
A4, A5: align() function groups A1 and A2 respectively by A3’s date sequence.
A6: Create a new table sequence consisting of the desired columns. The sign “\” means getting difference.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL