"I. Problem introduction & solving In [url], we tested the technique of pre-loading data ta .."

pjf RaqForum 22 No.
1 Reply • 725 View • 5 Years ago

Performance optimization skill: Partial Pre-Association

I. Problem introduction & solving

In Performance optimization skill: Pre-Association, we tested the technique of pre-loading data tables in memory and querying after pre-association. But if the size of dimension table(s) and fact table exceeds the memory, we can load the dimensional table(s) only in memory and index it (them) to achieve at least half of the hash computation, which is called partial pre-association.

We’ll use the lineitem table, which has the largest data volume and can not be loaded in memory, to test this solution by pre-loading the seven relatively small dimension files and leaving lineitem, the fact file, to be loaded in real time during the query.

II. SQL query test

Here Oracle database is still taken as the representative for testing SQL code, which calculates the total revenue of parts orders every year over the lineitem table.

1. Two-table join

The SQL query is as follows:

select
       l_year,
       sum(volume) as revenue
from
       (
              select
                     extract(year from l_shipdate) as l_year,
                     (l_extendedprice * (1 - l_discount) ) as volume
              from
                     lineitem,
                     part
              where
                     p_partkey = l_partkey
                     and length(p_type)>2
       ) shipping
group by
       l_year
order by
       l_year;

2. Six-table join

The SQL query is as follows:

select
       l_year,
       sum(volume) as revenue
from
       (
              select
                     extract(year from l_shipdate) as l_year,
                     (l_extendedprice * (1 - l_discount) ) as volume
              from
                     supplier,
                     lineitem,
                     orders,
                     customer,
                     part,
                     nation n1,
                     nation n2
              where
                     s_suppkey = l_suppkey
                     and p_partkey = l_partkey
                     and o_orderkey = l_orderkey
                     and c_custkey = o_custkey
                     and s_nationkey = n1.n_nationkey
                     and c_nationkey = n2.n_nationkey
                     and length(p_type) > 2
                     and n1.n_name is not null
                     and n2.n_name is not null
                     and s_suppkey > 0
       ) shipping
group by
       l_year
order by
       l_year;

3. Test results

	Two-table join	Six-table join
Query time (s)	235	2669

Both results are the best-performing among multiple executions.

The six-table join is 2669/235=11.4 times slower than the two-table join, indicting the notable performance degradation.

III. SPL partial pre-association test

1. Partial pre-association

SPL script:

	A
1	>env(region, file(path+"region.ctx").open().memory().keys@i(R_REGIONKEY))
2	>env(nation, file(path+"nation.ctx").open().memory().keys@i(N_NATIONKEY))
3	>env(supplier, file(path+"supplier.ctx").open().memory().keys@i(S_SUPPKEY))
4	>env(customer, file(path+"customer.ctx").open().memory().keys@i(C_CUSTKEY))
5	>env(part, file(path+"part.ctx").open().memory().keys@i(P_PARTKEY))
6	>env(orders,file(path+"orders.ctx").open().memory().keys@i(O_ORDERKEY))
7	>nation.switch(N_REGIONKEY,region)
8	>customer.switch(C_NATIONKEY,nation)
9	>supplier.switch(S_NATIONKEY,nation)
10	>orders.switch(O_CUSTKEY,customer)

The first 6 lines read in six dimension tables respectively in memory to generate six in-memory tables, create indexes on them and set them as global variables. The last 4 lines establish association among the six in-memory dimension tables. The pre-association script will be executed at the start of the SPL server to make preparations for subsequent query.

2. Two-table join

SPL script:

	A
1	=file("/home/btx/lineitem.btx").cursor@tb(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)
2	=A1.switch@i(L_PARTKEY,part).select(len(L_PARTKEY.P_TYPE)>2)
3	=A2.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE*(1 - L_DISCOUNT)):revenue)

Temporary loading needs the cursor-style retrieval and association on cursor. The code after that is similar to that for all-in-memory processing.

3. Six-table join

SPL script:

	A
1	=file("/home/btx/lineitem.btx").cursor@tb(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)
2	=A1.switch@i(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier)
3	=A2.select(len(L_PARTKEY.P_TYPE)>2 && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_NATIONKEY.N_NAME != null && L_SUPPKEY.S_SUPPKEY>0 )
4	=A3.groups(year(L_SHIPDATE):l_year;sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue)

Similar to the all-in-memory processing, the code after the cursor-style retrieval and association is quite concise and easy to understand.

4. Test results

	Two-table join	Six-table join
Query time (s)	266	472

The six-table join works only 1.8 times slower than the two-table join. The extra time is spent in associating L_ORDERKEY field and L_SUPPKEY field to corresponding dimension tables respectively and performing filtering by specified conditions (i.e. referring fields of these associated tables). There’s no need to handle associations among dimension tables because we already pre-associated them. And the performance also became faster since we cut off half of the hash computation for joins between the fact table lineitem and the dimension tables by creating indexes on them in advance.

VI. Summary

Test results:

Query time (s)	Two-table join	Six-table join	Performance decline (times)
SQL	235	2669	11.4
SPL pre-association	266	472	1.8

It takes SQL 11.4 times slower to execute the six-table join than to execute the two-table join, which shows that SQL consumes a lot of CPU to process JOIN, resulting in considerable decrease in performance. By contrast, SPL six-table join is only 1.8 times slower after pre-association. The latter is able to process even more table joins with a pretty stable performance.

In conclusion, for a query involving a lot of tables joins, if the memory is large enough to hold all the dimension tables, the partial pre-association in SPL can improve the performance effectively. Unfortunately, for relational databases, the database engine can’t optimize itself when many table joins are required, thus leading to inevitable, serious performance degradation.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

pjf • 725 View • 5 Years ago

Performance optimization skill: Partial Pre-Association

I. Problem introduction & solving

II. SQL query test

1. Two-table join

2. Six-table join

3. Test results

III. SPL partial pre-association test

1. Partial pre-association

2. Two-table join

3. Six-table join

4. Test results

VI. Summary

ToC