Performance optimization skill: Pre-Joining
I Problem introduction & solving
The JOIN performance in SQL is a long-standing problem, especially when there are a lot of tables to be joined, the performance plummets dramatically.
SQL adopts the HASH partition approach to handle JOINs, which first calculates the HASH values of the join keys and then traverses and compares the records with the same HASH values from two tables. The same computation steps are necessary for each JOIN.
If the data volume is small compared to the size of memory, we can load the data into memory in advance and create the association using the in-memory pointer mechanism. Specifically, this approach calculates HASH values and makes comparisons while loading data, and saves the associations with pointers. The subsequent calculations can directly refer to the associated records, which spares the trouble of calculating and comparing HASH values and improves the performance.
The problem is that SQL doesn’t support the pointer type of data to implement this optimization idea. Even if the data can be wholly loaded into the memory, we still can’t take advantage of the pre-association approach, which is also the same drawback of most SQL-based relational databases. On the contrary, SPL can implement the optimization thanks to its support in pointer data type.
In the following part, we’ll test SQL’s performing differences in achieving two-table join and multi-table join and then do the same calculations with SPL using the pre-association technique to make a comparison.
II Test environment
Eight data tables, 50G data in total (to be small enough to fit into the memory), have been generated according to TPCH standards. The structure of TPCH data table is vastly described online, which will not be elaborated here.
The server for testing has two Intel 2670 CPUs, 2.6G frequency, 16 cores in total, 128G memory and an SSD hard disk.
The lineitem table consists of too large data to be loaded in memory, so we create an orderdetail table of the same structure with appropriate data volume that can be loaded in memory. Hereinafter we will use this table for testing.
The following tests are calculated in single-thread without the help of multi-core to make the difference clearer.
III SQL query test
The Oracle database is used here as a representative for the SQL test, which queries the orderdetail table to get the total revenue of parts order for every year.
1. Two-table join
SQL statements:
select
l_year,
sum(volume) as revenue
from
(
select
extract(year from l_shipdate) as l_year,
(l_extendedprice * (1 - l_discount) ) as volume
from
orderdetail,
part
where
p_partkey = l_partkey
and length(p_type)>2
) shipping
group by
l_year
order by
l_year;
2. Six-table join
SQL statements:
select
l_year,
sum(volume) as revenue
from
(
select
extract(year from l_shipdate) as l_year,
(l_extendedprice * (1 - l_discount) ) as volume
from
supplier,
orderdetail,
orders,
customer,
part,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and length(p_type) > 2
and n1.n_name is not null
and n2.n_name is not null
and s_suppkey > 0
) shipping
group by
l_year
order by
l_year;
3. Test results
Two-table join |
Six-table join |
|
Query time (s) |
26 |
167 |
Both query statements contain nested queries. The auto-optimized Oracle query performs even better than a query without nested queries (The latter may calculate “group by” and “select” repeatedly).
Both results are the best-performing among multiple executions. We found that the first Oracle query execution is always the slowest, which means that the database will cache all the data into the memory (Oracle has a big cache pool) when the memory is sufficient to hold all the data. We select the fastest execution to get the computing time as pure as possible by getting rid of the external storage retrieval time.
By defining the filtering condition as always true, that is, no record is filtered out, we perform the queries over the whole orderdetail table, thus both tests have equal computing amount.
According to the test results, the six-table join is 167/26=6.4 times slower than the two-table join. Besides the external storage retrieval time, the performance degradation originates from the time-consuming table joins and simple comparisons between join key values.
In conclusion, the JOIN performance in SQL is markedly bad.
IV SPL pre-association test
1. Pre-association
SPL script:
A |
|
1 |
>env(region, file(path+"region.ctx").open().memory().keys@i(R_REGIONKEY)) |
2 |
>env(nation, file(path+"nation.ctx").open().memory().keys@i(N_NATIONKEY)) |
3 |
>env(supplier, file(path+"supplier.ctx").open().memory().keys@i(S_SUPPKEY)) |
4 |
>env(customer, file(path+"customer.ctx").open().memory().keys@i(C_CUSTKEY)) |
5 |
>env(part, file(path+"part.ctx").open().memory().keys@i(P_PARTKEY)) |
6 |
>env(orders,file(path+"orders.ctx").open().memory().keys@i(O_ORDERKEY)) |
7 |
>env(orderdetail,file(path+"orderdetail.ctx").open().memory()) |
8 |
>nation.switch(N_REGIONKEY,region) |
9 |
>customer.switch(C_NATIONKEY,nation) |
10 |
>supplier.switch(S_NATIONKEY,nation) |
11 |
>orders.switch(O_CUSTKEY,customer) |
12 |
>orderdetail.switch(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier) |
The first 7 lines read in seven composite tables respectively into memory to generate in-memory tables and set them as global variables. The last 5 lines establish association among the seven tables. The pre-association script will be executed at the start of the SPL server to make preparations for subsequent query.
Below is the structure of pre-associated table objects in memory, taking the orderdetail table for example:
Here only the first pre-associated record of the orderdetail table are shown in the picture, and other records are similar. Limited to the width of the page, only some fields are listed in each table.
2. Two-table join
SPL script:
A |
|
1 |
=orderdetail.select(len(L_PARTKEY.P_TYPE)>2) |
2 |
=A1.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
3. Six-table join
SPL script:
A |
|
1 |
=orderdetail.select(len(L_PARTKEY.P_TYPE)>2 && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_SUPPKEY>0 ) |
2 |
=A1.groups(year(L_SHIPDATE):l_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
The SPL script is quite simple after the tables are pre-associated. We can directly refer a field of the associated table as the sub-attribute of the referencing attribute, which makes the code easier to understand.
4. Test results
Two-table join |
Six-table join |
|
Query time (s) |
28 |
56 |
The six-table join query is only 2 times slower than the two-table join query due to the added computing time (spent in referring the associated field). Yet we don’t spend time on the join thanks to the pre-association.
V Summary
Test results:
Query time (s) |
Two-table join |
Six-table join |
Performance decline (times) |
SQL |
26 |
167 |
6.4 |
SPL pre-association |
28 |
56 |
2 |
A six-table SQL join is 6.4 times slower than a two-table SQL join, which means that a SQL join consumes a lot of CPU, resulting in obviously bad performance. With the pre-association method, SPL six-table join is only 2 times slower, and the performance degradation is kept in a very limited range.
On the premise that the memory is sufficient enough to load all the data (the application scenario for in-memory databases), pre-association is an effective technique to improve joins query performance when there are a lot of tables involved. SPL can use the method to make a difference while relational databases, including in-memory databases, cannot adopt the optimization skill.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version