"select p_brand,p_type,p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp,part wher .."

chenhu RaqForum 81 No.
1 Reply • 140 View • 2 Months ago

Performance optimization case course: TPCH-Q16

SPL(24)

select
    p_brand,p_type,p_size,
    count(distinct ps_suppkey) as supplier_cnt
from
    partsupp,part
where
    p_partkey = ps_partkey
    and p_brand <> 'Brand#21'
    and p_type not like 'SMALL%'
    and p_size in (2, 15, 17, 23, 25, 41, 44, 45)
    and ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
group by
    p_brand,
    p_type,
    p_size
order by
    supplier_cnt desc,
    p_brand,
    p_type,
    p_size;

This is a grouping & aggregation operation on the filtered result set of multi-table association.

1. Data storage

There is no special requirement for tables partsupp, part and supplier, store them in order by primary key.

Continue to use partsupp.ctx, part.ctx and supplier.ctx from Q2 query.

Copy these tables to the main directory of this query.

2. General method

The query is to match partsupp with the two dimension tables part and supplier, filter and then group, where the grouping operation is based on two dimension table fields.

Here we can use the optimization methods mentioned in previous articles. First, we filter the dimension tables, and then perform foreign key attributization on the primary table. The records that do not matchwiththe foreign key table will be filtered out during the process, and we can directly reference the dimension table fields during the grouping operation.

	A	B
1	=now()
2	="Brand#21"	="SMALL"
3	=[2,15,17,23,25,41,44,45]
4	=file("part.ctx").open().cursor@m(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=A2 && !pos@h(P_TYPE,B2) && A3.contain@b(P_SIZE)).fetch().keys@im(P_PARTKEY)
5	=file("supplier.ctx").open().cursor@m(S_SUPPKEY;!like(S_COMMENT,"CustomerComplaints*")).fetch().keys@im(S_SUPPKEY)
6	=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:A4,PS_SUPPKEY:A5)
7	=A6.groups@u(PS_PARTKEY.P_BRAND,PS_PARTKEY.P_TYPE,PS_PARTKEY.P_SIZE;icount(PS_SUPPKEY):supplier_cnt)
8	=A7.sort@m(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)
9	=interval@ms(A1,now())

A4 and A5 retrieve and filter the dimension tables respectively. A6 creates cursor while performing foreign key join and filtering.

Note that A4 uses contain@b when performing the IN judgement, which means the binary search will be adopted. When there are a lot of members in the set on which the IN judgment works, sorting members and using binary search can reduce the number of comparisons and increase performance.

Test result:

Test items	Execution time (seconds)
General method	18

3. Data conversion

For this query, we need to use the dimension table primary key sequence-numberization method. The partsupp_2.ctx and supplier_2.ctx from Q2 query can be used directly. Copy them to the main directory of this query.

In addition, we also need to convert the enumeration string field p_brand to numbers based on part_2.ctx of Q2 query.

Code for data conversion:

	A
1	=file("part_2.ctx").open().cursor().fetch()
2	=A1.id(P_BRAND).sort()
3	=file("p_brand.btx").export@b(A2)
4	=A1.run(P_PARTKEY=#, P_BRAND=A2.pos@b(P_BRAND))
5	=file("part_16_3.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)
6	>A5.append(A4.cursor())

Calculation code:

	A	B
1	=now()
2	Brand#21	SMALL
3	[2,15,17,23,25,41,44,45]
4	=file("p_brand.btx").import@b().(_1)	=A4.pos@b(A2)
5	=file("p_type.btx").import@b().( !pos@h(_1,B2))
6	=file("part_16_3.ctx").open()	=A6.cursor@m().skip().(null)
7	=A6.cursor@m(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=B4 && A3.contain@b(P_SIZE) && A5(P_TYPE)).fetch().(B6(P_PARTKEY)=~)
8	=file("supplier_2.ctx").open()	=A8.cursor@m().skip().(false)
9	=A8.cursor@m(S_SUPPKEY;!like(S_COMMENT,"CustomerComplaints*")).fetch().(B8(S_SUPPKEY)=true)
10	=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:B6:#,B8(PS_SUPPKEY))
11	=A4.len()+1
12	=A10.groups@u(PS_PARTKEY.P_TYPEA1146+PS_PARTKEY.P_SIZE*A11+PS_PARTKEY.P_BRAND:gk;icount(PS_SUPPKEY):supplier_cnt)
13	=A12.new@m(A4(gk%A11):P_BRAND,A5(gk\(A1146)):P_TYPE,gk%(A1146)\A11:P_SIZE,supplier_cnt)
14	=A13.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)
15	=interval@ms(A1,now())

A5, B6 and B8 use the alignment sequence technique explained in previous articles. A12 uses the grouping key technique to convert the three-field based grouping operation to the grouping operation by a single filed gk. A13 then computes the original three grouping fields using gk.

Test result:

Test items	Execution time (seconds)
General method	18
Data conversion	14

4. Column-wise computing

Calculation code:

	A	B
1	=now()
2	Brand#21	SMALL
3	[2,15,17,23,25,41,44,45]
4	=file("p_brand.btx").import@b().(_1)	=A4.pos@b(A2)
5	=file("p_type.btx").import@b().( !pos@h(_1,B2))
6	=file("part_16_3.ctx").open()	=A6.cursor@m().skip().(null)
7	=A6.cursor@mv(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=B4 && A3.contain@b(P_SIZE) && A5(P_TYPE)).fetch().(B6(P_PARTKEY)=~)
8	=file("supplier_2.ctx").open()	=A8.cursor@m().skip().(false)
9	=A8.cursor@mv(S_SUPPKEY;!like(S_COMMENT,"CustomerComplaints*")).fetch().(B8(S_SUPPKEY)=true)
10	=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:B6:#,B8(PS_SUPPKEY))
11	=A4.len()+1
12	=A10.groups@u(PS_PARTKEY.P_TYPEA1146+PS_PARTKEY.P_SIZE*A11+PS_PARTKEY.P_BRAND:gk;icount(PS_SUPPKEY):supplier_cnt)
13	=A12.new@m(A4(gk%A11):P_BRAND,A5(gk\(A1146)):P_TYPE,gk%(A1146)\A11:P_SIZE,supplier_cnt)
14	=A13.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)
15	=interval@ms(A1,now())

Test result:

Test items	Execution time (seconds)
General method	18
Data conversion	14
Column-wise computing	5

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

SPL(24)

esProc

chenhu • 140 View • 2 Months ago

Performance optimization case course: TPCH-Q16

1. Data storage

2. General method

3. Data conversion

4. Column-wise computing

ToC