Performance optimization skill: Partial Pre-Association
I. Problem introduction & solving
In Performance optimization skill: Pre-Association, we tested the technique of pre-loading data tables in memory and querying after pre-association. But if the size of dimension table(s) and fact table exceeds the memory, we can load the dimensional table(s) only in memory and index it (them) to achieve at least half of the hash computation, which is called partial pre-association.
We’ll use the lineitem table, which has the largest data volume and can not be loaded in memory, to test this solution by pre-loading the seven relatively small dimension files and leaving lineitem, the fact file, to be loaded in real time during the query.
II. SQL query test
Here Oracle database is still taken as the representative for testing SQL code, which calculates the total revenue of parts orders every year over the lineitem table.
1. Two-table join
The SQL query is as follows:
select
l_year,
sum(volume) as revenue
from
(
select
extract(year from l_shipdate) as l_year,
(l_extendedprice * (1 - l_discount) ) as volume
from
lineitem,
part
where
p_partkey = l_partkey
and length(p_type)>2
) shipping
group by
l_year
order by
l_year;
2. Six-table join
The SQL query is as follows:
select
l_year,
sum(volume) as revenue
from
(
select
extract(year from l_shipdate) as l_year,
(l_extendedprice * (1 - l_discount) ) as volume
from
supplier,
lineitem,
orders,
customer,
part,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and length(p_type) > 2
and n1.n_name is not null
and n2.n_name is not null
and s_suppkey > 0
) shipping
group by
l_year
order by
l_year;
3. Test results
Two-table join |
Six-table join |
|
Query time (s) |
235 |
2669 |
Both results are the best-performing among multiple executions.
The six-table join is 2669/235=11.4 times slower than the two-table join, indicting the notable performance degradation.
III. SPL partial pre-association test
1. Partial pre-association
SPL script:
A |
|
1 |
>env(region, file(path+"region.ctx").open().memory().keys@i(R_REGIONKEY)) |
2 |
>env(nation, file(path+"nation.ctx").open().memory().keys@i(N_NATIONKEY)) |
3 |
>env(supplier, file(path+"supplier.ctx").open().memory().keys@i(S_SUPPKEY)) |
4 |
>env(customer, file(path+"customer.ctx").open().memory().keys@i(C_CUSTKEY)) |
5 |
>env(part, file(path+"part.ctx").open().memory().keys@i(P_PARTKEY)) |
6 |
>env(orders,file(path+"orders.ctx").open().memory().keys@i(O_ORDERKEY)) |
7 |
>nation.switch(N_REGIONKEY,region) |
8 |
>customer.switch(C_NATIONKEY,nation) |
9 |
>supplier.switch(S_NATIONKEY,nation) |
10 |
>orders.switch(O_CUSTKEY,customer) |
The first 6 lines read in six dimension tables respectively in memory to generate six in-memory tables, create indexes on them and set them as global variables. The last 4 lines establish association among the six in-memory dimension tables. The pre-association script will be executed at the start of the SPL server to make preparations for subsequent query.
2. Two-table join
SPL script:
A |
|
1 |
=file("/home/btx/lineitem.btx").cursor@tb(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE) |
2 |
=A1.switch@i(L_PARTKEY,part).select(len(L_PARTKEY.P_TYPE)>2) |
3 |
=A2.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE*(1 - L_DISCOUNT)):revenue) |
Temporary loading needs the cursor-style retrieval and association on cursor. The code after that is similar to that for all-in-memory processing.
3. Six-table join
SPL script:
A |
|
1 |
=file("/home/btx/lineitem.btx").cursor@tb(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE) |
2 |
=A1.switch@i(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier) |
3 |
=A2.select(len(L_PARTKEY.P_TYPE)>2 && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_NATIONKEY.N_NAME != null && L_SUPPKEY.S_SUPPKEY>0 ) |
4 |
=A3.groups(year(L_SHIPDATE):l_year;sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue) |
Similar to the all-in-memory processing, the code after the cursor-style retrieval and association is quite concise and easy to understand.
4. Test results
Two-table join |
Six-table join |
|
Query time (s) |
266 |
472 |
The six-table join works only 1.8 times slower than the two-table join. The extra time is spent in associating L_ORDERKEY field and L_SUPPKEY field to corresponding dimension tables respectively and performing filtering by specified conditions (i.e. referring fields of these associated tables). There’s no need to handle associations among dimension tables because we already pre-associated them. And the performance also became faster since we cut off half of the hash computation for joins between the fact table lineitem and the dimension tables by creating indexes on them in advance.
VI. Summary
Test results:
Query time (s) |
Two-table join |
Six-table join |
Performance decline (times) |
SQL |
235 |
2669 |
11.4 |
SPL pre-association |
266 |
472 |
1.8 |
It takes SQL 11.4 times slower to execute the six-table join than to execute the two-table join, which shows that SQL consumes a lot of CPU to process JOIN, resulting in considerable decrease in performance. By contrast, SPL six-table join is only 1.8 times slower after pre-association. The latter is able to process even more table joins with a pretty stable performance.
In conclusion, for a query involving a lot of tables joins, if the memory is large enough to hold all the dimension tables, the partial pre-association in SPL can improve the performance effectively. Unfortunately, for relational databases, the database engine can’t optimize itself when many table joins are required, thus leading to inevitable, serious performance degradation.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL