Performance optimization case course: TPCH-Q16

 

select
    p_brand,p_type,p_size,
    count(distinct ps_suppkey) as supplier_cnt
from
    partsupp,part
where
    p_partkey = ps_partkey
    and p_brand <> 'Brand#21'
    and p_type not like 'SMALL%'
    and p_size in (2, 15, 17, 23, 25, 41, 44, 45)
    and ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
group by
    p_brand,
    p_type,
    p_size
order by
    supplier_cnt desc,
    p_brand,
    p_type,
    p_size;

This is a grouping & aggregation operation on the filtered result set of multi-table association.

1. Data storage

There is no special requirement for tables partsupp, part and supplier, store them in order by primary key.

Continue to use partsupp.ctx, part.ctx and supplier.ctx from Q2 query.

Copy these tables to the main directory of this query.

2. General method

The query is to match partsupp with the two dimension tables part and supplier, filter and then group, where the grouping operation is based on two dimension table fields.

Here we can use the optimization methods mentioned in previous articles. First, we filter the dimension tables, and then perform foreign key attributization on the primary table. The records that do not matchwiththe foreign key table will be filtered out during the process, and we can directly reference the dimension table fields during the grouping operation.


A

B

1

=now()


2

="Brand#21"

="SMALL"

3

=[2,15,17,23,25,41,44,45]


4

=file("part.ctx").open().cursor@m(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=A2 && !pos@h(P_TYPE,B2) && A3.contain@b(P_SIZE)).fetch().keys@im(P_PARTKEY)

5

=file("supplier.ctx").open().cursor@m(S_SUPPKEY;!like(S_COMMENT,"*Customer*Complaints*")).fetch().keys@im(S_SUPPKEY)

6

=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:A4,PS_SUPPKEY:A5)

7

=A6.groups@u(PS_PARTKEY.P_BRAND,PS_PARTKEY.P_TYPE,PS_PARTKEY.P_SIZE;icount(PS_SUPPKEY):supplier_cnt)

8

=A7.sort@m(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)

9

=interval@ms(A1,now())


A4 and A5 retrieve and filter the dimension tables respectively. A6 creates cursor while performing foreign key join and filtering.

Note that A4 uses contain@b when performing the IN judgement, which means the binary search will be adopted. When there are a lot of members in the set on which the IN judgment works, sorting members and using binary search can reduce the number of comparisons and increase performance.

Test result:

Test items

Execution time (seconds)

General method

18

3. Data conversion

For this query, we need to use the dimension table primary key sequence-numberization method. The partsupp_2.ctx and supplier_2.ctx from Q2 query can be used directly. Copy them to the main directory of this query.

In addition, we also need to convert the enumeration string field p_brand to numbers based on part_2.ctx of Q2 query.

Code for data conversion:


A

1

=file("part_2.ctx").open().cursor().fetch()

2

=A1.id(P_BRAND).sort()

3

=file("p_brand.btx").export@b(A2)

4

=A1.run(P_PARTKEY=#, P_BRAND=A2.pos@b(P_BRAND))

5

=file("part_16_3.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)

6

>A5.append(A4.cursor())

Calculation code:


A

B

1

=now()


2

Brand#21

SMALL

3

[2,15,17,23,25,41,44,45]


4

=file("p_brand.btx").import@b().(_1)

=A4.pos@b(A2)

5

=file("p_type.btx").import@b().( !pos@h(_1,B2))

6

=file("part_16_3.ctx").open()

=A6.cursor@m().skip().(null)

7

=A6.cursor@m(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=B4 && A3.contain@b(P_SIZE) && A5(P_TYPE)).fetch().(B6(P_PARTKEY)=~)

8

=file("supplier_2.ctx").open()

=A8.cursor@m().skip().(false)

9

=A8.cursor@m(S_SUPPKEY;!like(S_COMMENT,"*Customer*Complaints*")).fetch().(B8(S_SUPPKEY)=true)

10

=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:B6:#,B8(PS_SUPPKEY))

11

=A4.len()+1


12

=A10.groups@u(PS_PARTKEY.P_TYPE*A11*46+PS_PARTKEY.P_SIZE*A11+PS_PARTKEY.P_BRAND:gk;icount(PS_SUPPKEY):supplier_cnt)

13

=A12.new@m(A4(gk%A11):P_BRAND,A5(gk\(A11*46)):P_TYPE,gk%(A11*46)\A11:P_SIZE,supplier_cnt)

14

=A13.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)

15

=interval@ms(A1,now())


A5, B6 and B8 use the alignment sequence technique explained in previous articles. A12 uses the grouping key technique to convert the three-field based grouping operation to the grouping operation by a single filed gk. A13 then computes the original three grouping fields using gk.

Test result:

Test items

Execution time (seconds)

General method

18

Data conversion

14

4. Column-wise computing

Calculation code:


A

B

1

=now()


2

Brand#21

SMALL

3

[2,15,17,23,25,41,44,45]


4

=file("p_brand.btx").import@b().(_1)

=A4.pos@b(A2)

5

=file("p_type.btx").import@b().( !pos@h(_1,B2))

6

=file("part_16_3.ctx").open()

=A6.cursor@m().skip().(null)

7

=A6.cursor@mv(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=B4 && A3.contain@b(P_SIZE) && A5(P_TYPE)).fetch().(B6(P_PARTKEY)=~)

8

=file("supplier_2.ctx").open()

=A8.cursor@m().skip().(false)

9

=A8.cursor@mv(S_SUPPKEY;!like(S_COMMENT,"*Customer*Complaints*")).fetch().(B8(S_SUPPKEY)=true)

10

=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:B6:#,B8(PS_SUPPKEY))

11

=A4.len()+1


12

=A10.groups@u(PS_PARTKEY.P_TYPE*A11*46+PS_PARTKEY.P_SIZE*A11+PS_PARTKEY.P_BRAND:gk;icount(PS_SUPPKEY):supplier_cnt)

13

=A12.new@m(A4(gk%A11):P_BRAND,A5(gk\(A11*46)):P_TYPE,gk%(A11*46)\A11:P_SIZE,supplier_cnt)

14

=A13.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)

15

=interval@ms(A1,now())


Test result:

Test items

Execution time (seconds)

General method

18

Data conversion

14

Column-wise computing

5