Multistep Complex SQL Queries Optimization Example
Key words: SQL Group By, aggregate, multistep, multi-table, complex
Question
Source data: In the main table deliver, one customer corresponds to more than one delivery project:
deliverItem is the subtable, where a delivery project includes multiple delivery items which can be in different status (1 – Successfully delivered, 2 – Not delivered):
Requirement: For each customer, count the successfully delivered items and the not delivered items.
Answer
The intuitive solution:
1. Over the deliverItem table, count the not delivered items for each delivery project (notDelivered);
2. Join Step1’s result set with deliver table to get a new result set [customer,notDelivered];
3. Group records in Step 2’s result set by customer and, for each group count records where notDelivered is 0 and those where notDelivered is greater than 0.
SQL query:
Select r1.customer,r1.complete,r2.notComplete
from
(select customer, count(*) as complete
from
(select d.customer,d2.notDelivered
from deliver d
left join
(select deliverID,count(*) as notDelivered
from deliverItem
where delivered==1
group by deliverID) d2
on d.deliverID=d2.deliverID)
where notDelivered is null
group by customer
) r1
join
(select customer,count(*) as notComplete
from
(select d.customer, d2.notDelivered
from deliver d
left join
(select deliverID,count(*) as notDelivered
from deliverItem
where delivered==1
group by deliverID) d2
on d.deliverID=d2.deliverID)
where notDelivered<>null
group by customer
) r2
on r1.customer=r2.customer
I encountered difficulties trying to express the algorithm in SQL in an intuitive way. Yet I finally found SQL workarounds for them to work out the above query. Another programmer may make their own way through the SQL jungle. So there may be various workarounds and techniques depending on the optimization solution a programmers uses. But whatever they are they all conform to the following steps:
1. Get all records containing not delivered items from deliverItem table, group them by deliverID and count notDelivered items for each group;
2. Left join deliver table and Step 1’s result set to get a new result set [customer,notDelivered];
3. Group records in the joining result by customer and count records where notDelivered is null (complete) to get the result set [customer,complete];
4. Re-perform Step 1;
5. Re-perform Step 2;
6. Here’s similar to Step 3. Group records in the joining result by customer and count records where notDelivered is greater than 0 (notCcomplete) to get the result set [customer,notComplete];
7. Join Step 3’s result set and Step 6’s result set to get the desirable result set [customer,complete,notComplete].
Let’s look at how SPL express the algorithm (SPL script):
A |
|
1 |
=connect("mysqlDB") |
2 |
=A1.query(“select * from deliver”) |
3 |
=A1.query(“select * from deliverItem”) |
4 |
=A3.group(deliverID;~.select(delivered==1).len():notDelivered) |
5 |
=A2.switch(deliverID,A4:deliverID) |
6 |
=A5.group(customer; ~.select(deliverID.notDelivered>0).len():notComplete, ~.select(deliverID.notDelivered==0).len():complete) |
7 |
=A1.close() |
A1: Connect to the database;
A2/A3: Load data respectively from both tables (esProc also has functions to conveniently retrieve data from an Excel file or a CSV file);
A4/A5/A6 do the real query work in an almost intuitive way.
A4: Group deliverItem table by deliverID and count not delivered items (notDelivered) in each group (this includes the group where notDelivered is 0);
A5: Join A4’s result set with the deliver table, that is, replace deliver’s deliverID values by their referenced records in A4 using switch() function. SQL doesn’t have a counterpart to express this kind of nested action (The following pic shows the detailed process of the multilevel reference). So it takes a very roundabout route to handle it.
A6: Group A5’s result set by customer to count in each group the delivered items is greater than 0 (notComplete) and the successfully delivered items, i.e. notDelivered =0, (complete).
Summary
The so-called brilliant SQL workarounds are in effect inefficient, unreadable and difficult to debug due to their lack of universality. After all, real-world businesses need simple and convenient solutions instead of skill show-off.
Based on innovative mathematic theory, the discrete data set, esProc SPL (Structured Process Language) abandons the ancient relational algebra, on which SQL is based, to embrace order-based concept, complete set-orientation and stepwise coding. It thus enables programmers to describe an algorithm in an intuitive, simple, smooth and efficient way, which helps reduce development and maintenance costs. It also increases performance because it makes high-performing algorithms achievable.
Read more SPL use cases and SQL optimization tricks in https://c.scudata.com/tag/esProc.
Want to have a try?
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL