Performance optimization case course: TPCH-Q12

 

select
    l_shipmode,
    sum(case
        when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1
        else 0 end) as high_line_count,
    sum(case
        when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1
        else 0 end) as low_line_count
from
    orders,
    lineitem
where
    o_orderkey = l_orderkey
    and l_shipmode in ('TRUCK', 'MAIL')
    and l_commitdate < l_receiptdate
    and l_shipdate < l_commitdate
    and l_receiptdate >= date '1995-01-01'
    and l_receiptdate < date '1995-01-01' + interval '1' year
group by
    l_shipmode
order by
    l_shipmode;

This is a grouping & aggregation operation on the filtered result set of association between primary table orders and sub table lineitem.

1. Data storage

Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining.

Continue to use orders.ctx and lineitem.ctx from Q3.

Copy these tables to the main directory of this query.

2. General method


A

1

=now()

2

1995-1-1

3

=elapse@y(A2, 1)

4

=["MAIL", "TRUCK"]

5

=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_SHIPMODE;L_RECEIPTDATE>=A2 && L_RECEIPTDATE<A3 && A4.contain(L_SHIPMODE) && L_COMMITDATE<L_RECEIPTDATE && L_SHIPDATE<L_COMMITDATE)

6

=file("orders.ctx").open().new@r(A5,L_SHIPMODE,O_ORDERPRIORITY)

7

=A6.run(O_ORDERPRIORITY=if(O_ORDERPRIORITY=="1-URGENT" || O_ORDERPRIORITY=="2-HIGH",1,0))

8

=A7.groups(L_SHIPMODE;sum(O_ORDERPRIORITY):high_line_count, sum(1-O_ORDERPRIORITY):low_line_count)

9

=interval@ms(A1,now())

Here we use the optimization methods mentioned in previous articles, such as pre-cursor filtering, multi-thread parallel computing.

The conditions for high_line_count and low_line_count are opposite.If summation operation is not involved, the latter is 0 when the former is 1, and the latter is 1 when the former is 0. Therefore,A7 and A8 use O_ORDERPRIORITY and 1-O_ORDERPRIORITY to calculate these two values beforesummation.

Test result:

Test items

Execution time (seconds)

General method

6

3. Data conversion

Utilize the conversion methods mentioned in previous articles: convert date to integer, convert enumeration string field to number.

We can directly use orders_4.ctx converted in Q4.

Copy these tables to the main directory of this query.

For lineitem table, we can convert the enumeration string field l_shipmod to number based on Q3.

Code for data conversion:


A

1

=file("lineitem.ctx").open()

2

=A1.cursor(L_SHIPMODE)

3

=A2.id(L_SHIPMODE).sort()

4

=file("l_shipmode.btx").export@b(A3)

5

=A1.cursor()

6

=A5.run(L_SHIPMODE=A3.pos@b(L_SHIPMODE),L_RECEIPTDATE=days@o(L_RECEIPTDATE),L_COMMITDATE=days@o(L_COMMITDATE),L_SHIPDATE=days@o(L_SHIPDATE))

7

=file("lineitem_3.ctx").create@py(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

8

>A7.append(A6)

Calculation code:


A

1

=now()

2

>l_shipmode=file("l_shipmode.btx").import@b().(_1)

3

>o_orderpriority=file("o_orderpriority.btx").import@b().(_1)

4

1995-1-1

5

=days@o(elapse@y(A4, 1))

6

=days@o(A4)

7

=l_shipmode.(["MAIL", "TRUCK"].contain(~))

8

=o_orderpriority.(["1-URGENT","2-HIGH"].contain(~))

9

=file("lineitem_3.ctx").open().cursor@m(L_ORDERKEY,L_SHIPMODE;L_RECEIPTDATE>=A6 && L_RECEIPTDATE<A5 && A7(L_SHIPMODE) && L_COMMITDATE<L_RECEIPTDATE && L_SHIPDATE<L_COMMITDATE)

10

=file("orders_4.ctx").open().new@r(A9,L_SHIPMODE,O_ORDERPRIORITY)

11

=A10.run(O_ORDERPRIORITY=A8(O_ORDERPRIORITY))

12

=A11.groups(L_SHIPMODE;count(O_ORDERPRIORITY):high_line_count, count(!O_ORDERPRIORITY):low_line_count)

13

=A12.run(L_SHIPMODE=l_shipmode(L_SHIPMODE))

14

=interval@ms(A1,now())

A7 and A8 are aligned sequence.

In A11, A8 (O_ORDERPRIORITY) returns true or false, corresponding to the original 1 and 0. The result of count in A12 is the same as that of the original sum, which eliminates if judgement and improves performance.

Test result:

Test items

Execution time (seconds)

General method

6

Data conversion

4

4. Column-wise computing


A

1

=now()

2

>l_shipmode=file("l_shipmode.btx").import@b().(_1)

3

>o_orderpriority=file("o_orderpriority.btx").import@b().(_1)

4

1995-1-1

5

=days@o(elapse@y(A4, 1))

6

=days@o(A4)

7

=l_shipmode.(["MAIL", "TRUCK"].contain(~))

8

=o_orderpriority.(["1-URGENT","2-HIGH"].contain(~))

9

=file("lineitem_3.ctx").open().cursor@mv(L_ORDERKEY,L_SHIPMODE;L_RECEIPTDATE>=A6 && L_RECEIPTDATE<A5 && A7(L_SHIPMODE) && L_COMMITDATE<L_RECEIPTDATE && L_SHIPDATE<L_COMMITDATE)

10

=file("orders_4.ctx").open().new@r(A9,L_SHIPMODE,O_ORDERPRIORITY)

11

=A10.derive@o(A8(O_ORDERPRIORITY):flag)

12

=A11.groups(L_SHIPMODE;count(flag):high_line_count,count(!flag):low_line_count)

13

=A12.new(l_shipmode(L_SHIPMODE):L_SHIPMODE,high_line_count,low_line_count)

14

=interval@ms(A1,now())

Since column-wise computing cannot assign value to a field, a new field flag is defined in A11.

Test result:

Test items

Execution time (seconds)

General method

6

Data conversion

4

Column-wise computing

2