"select o_year, sum(case when nation = 'CHINA' then volume else 0 end) / sum(volume) as mkt_share .."

chenhu RaqForum 81 No.
1 Reply • 184 View • 3 Months ago

Performance optimization case course: TPCH-Q8

SPL(24)

select
    o_year,
    sum(case
        when nation = 'CHINA' then volume
        else 0
        end) / sum(volume) as mkt_share
from
    (
        Select
            extract(year from o_orderdate) as o_year,
            l_extendedprice * (1 - l_discount) as volume,
            n2.n_name as nation
        from
            part,
            supplier,
            lineitem,
            orders,
            customer,
            nation n1,
            nation n2,
            region
        where
            p_partkey = l_partkey
            and s_suppkey = l_suppkey
            and l_orderkey = o_orderkey
            and o_custkey = c_custkey
            and c_nationkey = n1.n_nationkey
            and n1.n_regionkey = r_regionkey
            and r_name = 'ASIA'
            and s_nationkey = n2.n_nationkey
            and o_orderdate between date '1995-01-01' and date '1996-12-31'
            and p_type = 'STANDARD POLISHED BRASS'
    ) all_nations
group by
    o_year
order by
    o_year;

This is a grouping and aggregation operation on the filtered result set of multi-table association, which involves a primary-sub table association between orders table and lineitem table.

1. Data storage

Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining. For other tables, there is no special sorting requirement, they are all stored in order by primary key.

So, we can continue to use orders.ctx, lineitem.ctx and customer.ctx from Q3.

For tables part and supplier, we can directly use part.ctx and supplier.ctx from Q2.

For tables region and nation, we can directly use region.btx and nation.btx from Q2.

Copy these tables to the main directory of this query.

2. General method

	A	B
1	=now()
2	1995-01-01	1996-12-31
3	>nation="CHINA"	>name="ASIA"
4	>type="STANDARD POLISHED BRASS"
5	=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY
6	=file("nation.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY).keys@i(N_NATIONKEY)
7	=A6.select@1(N_NAME==nation).N_NATIONKEY
8	=A6.select(N_REGIONKEY==A5).derive@o().keys(N_NATIONKEY)
9	=file("supplier.ctx").open().cursor(S_SUPPKEY,S_NATIONKEY;S_NATIONKEY:A6).fetch().keys@im(S_SUPPKEY)
10	=file("part.ctx").open().cursor@m(P_PARTKEY;P_TYPE==type).fetch().keys@im(P_PARTKEY)
11	=file("customer.ctx").open().cursor@m(C_CUSTKEY;A8.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)
12	=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && A11.find(O_CUSTKEY))
13	=file("lineitem.ctx").open().news(A12,L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;A10.find(L_PARTKEY),L_SUPPKEY:A9)
14	=A13.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT))
15	=A14.groups(year(O_ORDERDATE):o_year;sum(if(L_SUPPKEY.S_NATIONKEY.N_NAME==nation,L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2)
16	=A15.new(o_year,s1/s2:mkt_share)
17	=interval@ms(A1,now())

Here we use the techniques mentioned in the previous articles: filter at cursor creation, order-based primary-sub table association, and convert association field to foreign key table pointers.

However, L_SUPPKEY.S_NATIONKEY.N_NAME in A15 is not the filtering of dimension table that appears in JOIN or WHERE clause as mentioned in the previous article, but appears in the conditional sum expression. The implementation method here is that the supplier table is joined with the nation table in A9, the lineitem table is joined with the supplier table in A13, and the foreign key field, as a condition, appears in sum(if) in A15.

Test items	Execution time (seconds)
General method	18

3. The filtering of dimension table in calculation formula

In the previous section, L_SUPPKEY.S_NATIONKEY.N_NAME in A15 is associated with many tables, and the essence of sum(case) is to performconditionalsummationon volume based on whether l_suppkey meets the criteria.

For this calculation requirement, if we filter the supplier first and then use the obtained s_suppkey set as the condition for l_suppkey to do a conditional summation, it can avoid supplier from participating in subsequent association calculation.

Code in SQL:

select
    o_year,
    sum(case
        when l_suppkey in(
			           select s_suppkey
			           from supplier,nation
			           where
                                  s_nationkey=n_nationkey
                                  and n_name='CHINA'
                         )
        then volume
        else 0
        end) / sum(volume) as mkt_share
from
    (
        Select
            extract(year from o_orderdate) as o_year,
            l_extendedprice * (1 - l_discount) as volume,
            l_suppkey
        from
            part,
            lineitem,
            orders,
            customer,
            nation n1,
            region
        where
            p_partkey = l_partkey
            and s_suppkey = l_suppkey
            and l_orderkey = o_orderkey
            and o_custkey = c_custkey
            and c_nationkey = n1.n_nationkey
            and n1.n_regionkey = r_regionkey
            and r_name = 'ASIA'
            and o_orderdate between date '1995-01-01' and date '1996-12-31'
            and p_type = 'STANDARD POLISHED BRASS'
    ) all_nations
group by
    o_year
order by
    o_year;

After rewriting in SQL, the condition for sum(case) becomes an IN clause, where the subquery is a statement with JOIN and WHERE clauses. In SPL, we can further optimize by filtering dimension table. So, the supplier and its dimension table nation(n2) will not appear in the subsequent association calculation.

Calculation code:

	A	B
1	=now()
2	1995-01-01	1996-12-31
3	>nation="CHINA"	>name="ASIA"
4	>type="STANDARD POLISHED BRASS"
5	=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY
6	=file("nation.btx").import@b()
7	=A6.select@1(N_NAME==nation).N_NATIONKEY
8	=A6.select(N_REGIONKEY==A5).derive@o().keys(N_NATIONKEY)
9	=file("supplier.ctx").open().cursor@m(S_SUPPKEY;S_NATIONKEY==A7).fetch().keys@im(S_SUPPKEY)
10	=file("part.ctx").open().cursor@m(P_PARTKEY;P_TYPE==type).fetch().keys@im(P_PARTKEY)
11	=file("customer.ctx").open().cursor@m(C_CUSTKEY;A8.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)
12	=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && A11.find(O_CUSTKEY))
13	=file("lineitem.ctx").open().news(A12,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;A10.find(L_PARTKEY))
14	=A13.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT))
15	=A14.groups(year(O_ORDERDATE):o_year;sum(if(A9.find(L_SUPPKEY),L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2)
16	=A15.new(o_year,s1/s2:mkt_share)
17	=interval@ms(A1,now())

A15 adopts the calculation method mentioned above and uses L_SUPPKEY to search for the primary key S_SUPPKEY of table sequence in A9 and take it as the condition for summation. This way, the supplier in A9 doesn’t need to be joined with nation, and the lineitem in A13 doesn’t need to be associated with supplier.

Test result:

Test items	Execution time (seconds)
General method	18
The filtering of dimension table in calculation formula	16

4. Data conversion

Utilize the conversion methods mentioned in previous articles: convert primary key of dimension table to sequence number, convert date to integer.

For tables region, nation, supplier and part, we can directly use region_2.btx, nation_2.btx, supplier_2.ctx and part_2.ctx converted in Q2.

To convert the enumeration string field type in the part table to number, we can use p_type.btx from Q2.

For the customer table, we can use customer_3.ctx from Q5.

For the orders table, we can use orders_5.ctx from Q3.

Copy these tables to the main directory of this query.

Code for data conversion:

	A
1	=file("part.ctx").open().import(P_PARTKEY).keys@im(P_PARTKEY)
2	=file("lineitem.ctx").open().cursor()
3	=A2.run(L_PARTKEY=A1.pfind(L_PARTKEY))
4	=file("lineitem_4.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)
5	>A4.append(A3)

Calculation code:

	A	B
1	=now()
2	=days@o(date("1995-01-01"))	=days@o(date("1996-12-31"))
3	>nationName="CHINA"	>name="ASIA"
4	=p_type=file("p_type.btx").import@b().(_1)
5	=nation= file("nation_2.btx").import@b()
6	>type=p_type.pos@b("STANDARD POLISHED BRASS")
7	=file("region_2.btx").import@b().pselect(R_NAME==name)
8	=nation.pselect(N_NAME==nationName)	=nation.(N_REGIONKEY==A7)
9	=file("supplier_2.ctx").open()	=A9.cursor@m().skip().(false)
10	=A9.cursor@m(S_SUPPKEY;S_NATIONKEY==A8).fetch().(B9(S_SUPPKEY)=true)
11	=file("part_2.ctx").open()	=A11.cursor@m().skip().(false)
12	=A11.cursor@m(P_PARTKEY;P_TYPE==type).fetch().(B11(P_PARTKEY)=true)
13	=file("customer_3.ctx").open()	=A13.cursor@m().skip().(false)
14	=A13.cursor@m(C_CUSTKEY;B8(C_NATIONKEY)).fetch().(B13(C_CUSTKEY)=true)
15	=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && B13(O_CUSTKEY))
16	=file("lineitem_4.ctx").open().news(A15,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;B11(L_PARTKEY))
17	=A16.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT))
18	=A17.groups(year(O_ORDERDATE):o_year;sum(if(B9(L_SUPPKEY),L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2)
19	=A18.new(o_year,s1/s2:mkt_share)
20	=interval@ms(A1,now())

B8, B9, B11 and B13 are aligned sequence.

A18 uses the aligned sequence B9 to determine if L_SUPPKEY meets the sum condition, which achieves better performance than the primary key search mentioned previously.

Test result:

Test items	Execution time (seconds)
General method	18
The filtering of dimension table in calculation formula	16
Data conversion	10

5. Column-wise computing

	A	B
1	=now()
2	=days@o(date("1995-01-01"))	=days@o(date("1996-12-31"))
3	>nationName="CHINA"	>name="ASIA"
4	=p_type=file("p_type.btx").import@b().(_1)
5	=nation= file("nation_2.btx").import@b()
6	>type=p_type.pos@b("STANDARD POLISHED BRASS")
7	=file("region_2.btx").import@b().pselect(R_NAME==name)
8	=nation.pselect(N_NAME==nationName)	=nation.(N_REGIONKEY==A7)
9	=file("supplier_2.ctx").open()	=A9.cursor@m().skip().(false)
10	=A9.cursor@mv(S_SUPPKEY;S_NATIONKEY==A8).fetch().(B9(S_SUPPKEY)=true)
11	=file("part_2.ctx").open()	=A11.cursor@m().skip().(false)
12	=A11.cursor@mv(P_PARTKEY;P_TYPE==type).fetch().(B11(P_PARTKEY)=true)
13	=file("customer_3.ctx").open()	=A13.cursor@m().skip().(false)
14	=A13.cursor@mv(C_CUSTKEY;B8(C_NATIONKEY)).fetch().(B13(C_CUSTKEY)=true)
15	=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE<=B2 && B13(O_CUSTKEY))
	=file("lineitem_4.ctx").open().cursor@v(L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;B11(L_PARTKEY);A15)
16	=A16.joinx@im(L_ORDERKEY,A15:O_ORDERKEY,O_ORDERDATE)
17	=A17.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp)
18	=A18.groups(year(O_ORDERDATE):o_year;sum(if(B9(L_SUPPKEY),dp,0)):s1,sum(dp):s2)
19	=A19.new(o_year,s1/s2:mkt_share)
20	=interval@ms(A1,now())
21	=now()

We do not use news() function to perform the primary-sub table association between orders and lineitem. Instead, we use joinx@im to perform an order-based merge join after cursors are synchronously segmented. Of course, the news() function is applicable, but joinx() function is slightly faster in our testing.

A16 uses derive@o to add a new column dp, so there is no need to compute dp again in A18. For column-wise computing, try to avoid assignment operation. Therefore, instead of using run to assign value to the original column, A16 uses derive to define a new column.

Test result:

Test items	Execution time (seconds)
General method	18
The filtering of dimension table in calculation formula	16
Data conversion	10
Column-wise computing	4

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

SPL(24)

esProc

chenhu • 184 View • 3 Months ago

Performance optimization case course: TPCH-Q8

1. Data storage

2. General method

3. The filtering of dimension table in calculation formula

4. Data conversion

5. Column-wise computing

ToC