Performance optimization case course: TPCH-Q11

 

select
    ps_partkey,
    sum(ps_supplycost * ps_availqty) as value
from
    partsupp,
    supplier,
    nation
where
    ps_suppkey = s_suppkey
    and s_nationkey = n_nationkey
    and n_name = 'CHINA'
group by
  ps_partkey
having
  sum(ps_supplycost * ps_availqty) > (
      select
          sum(ps_supplycost * ps_availqty) * 0.000001
      from
          partsupp,
          supplier,
          nation
      where
          ps_suppkey = s_suppkey
          and s_nationkey = n_nationkey
          and n_name = 'CHINA'
    )
order by
    value desc;

If we regard the following sub-query as a view V,

select
    ps_partkey,
    sum(ps_supplycost * ps_availqty) as value
from
    partsupp,
    supplier,
    nation
where
    ps_suppkey = s_suppkey
    and s_nationkey = n_nationkey
    and n_name = 'CHINA'
group by
ps_partkey

then the original query statement is equivalent to:

select
    ps_partkey,
    value
from V
where value>0.000001*(select sum(value) from V)

Since the view V here is already agrouped result set, which contains a relatively small number of records, traversing V is much less computation-intensive than traversing partsupp table directly.

1. Data storage

Store partsupp in order by the primary key ps_partkey. We can adopt order-based grouping method to improve the computing performance of grouping and aggregation.

For other tables, there is no special sorting requirement, they are all stored in order by primary key.

Continue to use partsupp.ctx, supplier.ctx and nation.btx from Q2.

Copy these tables to the main directory of this query.

2. General method


A

1

=now()

2

>name="CHINA"

3

>percent=0.000001

4

=file("nation.btx").import@b().select@1(N_NAME== name).N_NATIONKEY

5

=file("supplier.ctx").open().cursor@m(S_SUPPKEY;S_NATIONKEY==A4).fetch().keys@i(S_SUPPKEY)

6

=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_AVAILQTY,PS_SUPPLYCOST;A5.find(PS_SUPPKEY))

7

=A6.groups@o(PS_PARTKEY;sum(PS_SUPPLYCOST*PS_AVAILQTY):value)

8

=A7.sum(value)*percent

9

=A7.select(value>A8).sort@z(value)

10

=interval@ms(A1,now())

Here we use the optimization methods mentioned in previous articles, such as pre-cursor filtering, multi-thread parallel computing.

The partsupp table is ordered by the primary key PS_PATKEY, so we can use the order-based grouping strategy to perform grouping operation on this field in order to increase performance of computing view V. A7 uses groups@o to perform the order-based grouping, which amounts to computing view V. Then A8 and A9 traverse A7 twice to get the final result.

Test result:

Test items

Execution time (seconds)

General method

7

3. Data conversion

Utilize the conversion method mentioned in previous article: convert primary key of dimension table to sequence number.

We can directly use partsupp_2.ctx, supplier_2.ctx and nation_2.btx converted in Q2.

Copy these tables to the main directory of this query.

Calculation code:


A

B

1

=now()


2

>name="CHINA"

>percent=0.000001

3

=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME)

4

=A3.(N_NAME== name)


5

=file("supplier_2.ctx").open()

=A5.cursor@m().skip().(false)

6

=file("supplier_2.ctx").open().cursor@m(S_SUPPKEY;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=true)

7

=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_AVAILQTY,PS_SUPPLYCOST;B5(PS_SUPPKEY))

8

=A7.groups@o(PS_PARTKEY;sum(PS_SUPPLYCOST*PS_AVAILQTY):value)

9

=A8.sum(value)*percent


10

=A8.select(value>A9).sort@z(value)


11

=interval@ms(A1,now())


A4 and B5 are aligned sequence.

Test result:

Test items

Execution time (seconds)

General method

7

Data conversion

4

4. Column-wise computing


A

B

1

=now()


2

>name="CHINA"

>percent=0.000001

3

=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME)

4

=A3.(N_NAME== name)


5

=file("supplier_2.ctx").open()

=A5.cursor@m().skip().(false)

6

=file("supplier_2.ctx").open().cursor@mv(S_SUPPKEY;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=true)

7

=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_AVAILQTY,PS_SUPPLYCOST;B5(PS_SUPPKEY))

8

=A7.groups@o(PS_PARTKEY;sum(PS_SUPPLYCOST*PS_AVAILQTY):value)

9

=A8.sum(value)*percent


10

=A8.select(value>A9).sort@z(value)


11

=interval@ms(A1,now())


Test result:

Test items

Execution time (seconds)

General method

7

Data conversion

4

Column-wise computing

2