Performance optimization case course: TPCH-Q15

 

create view revenue (supplier_no, total_revenue) as
    select
        l_suppkey,
        sum(l_extendedprice * (1 - l_discount))
    from
        lineitem
    where
        l_shipdate >= date '1995-04-01'
        and l_shipdate < date '1995-04-01' + interval '3' month
    group by
        l_suppkey;

select
    s_suppkey,
    s_name,
    s_address,
    s_phone,
    total_revenue
from
    supplier,
    revenue
where
    s_suppkey = supplier_no
    and total_revenue = (
        select
            max(total_revenue)
        from
            revenue
    )
order by
    s_suppkey;

drop view revenue;

This code is to generate a view revenue first, and then associate it with supplier, and finally select the record containing the maximum value.

1. Data storage

There is no special requirement for tables lineitem and supplier, they are stored in order by primary key.

We can continue to use lineitem.ctx from Q3 and supplier.ctx from Q2.

Copy these tables to the main directory of this query.

2. General method

We divide this query into two phases. The first phase generates a view revenue, and the second phase gets the record where total_revenue is the biggest from revenue. The former is a regular grouping & aggregation operation, where parallel processing is used to improve performance.

SPL offers A.maxp method that can directly return the record containing the maximum value. Since the number of records corresponding to the maximum value will not be too large, the big dimension table search method mentioned in Q10 can be adopted.


A

1

=now()

2

1995-4-1

3

=elapse@m(A2,3)

4

=file("lineitem.ctx").open().cursor@m(L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A2 && L_SHIPDATE<A3)

5

=A4.groups@u(L_SUPPKEY:supplier_no;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):total_revenue)

6

=A5.maxp@a(total_revenue)

7

=file("supplier.ctx").open()

8

=A6.joinx@q(supplier_no,A7:S_SUPPKEY,S_NAME,S_ADDRESS,S_PHONE)

9

=interval@ms(A1,now())

The code utilizes optimization methods mentioned in previous articles, such as dimension table attributization, multi-thread parallel processing.

A8 uses the big dimension table search method to implement the join between A6 and big dimension table supplier.

Test result:

Test items

Execution time (seconds)

General method

6

3. Data conversion

There is a need to convert the L_SHIPDATE in lineitem table to integer as described in previous articles, and convert the primary key of lineitem’s dimension table supplier to sequence number so as to perform sequence-number-based grouping. The sequence-numberization of dimension table primary key needs to be performed based on the lineitem.ctx in Q3 and the supplier.ctx in Q2.

In addition, it also needs to copy the supplier_2.ctx that has performed sequence-numberization in Q2 to the main directory of this query.

Code for data conversion:


A

1

=file("supplier.ctx").open().import(S_SUPPKEY).keys@im(S_SUPPKEY)

2

=file("lineitem.ctx").open().cursor()

3

=A2.run(L_SUPPKEY=A1.pfind(L_SUPPKEY),L_SHIPDATE=days@o(L_SHIPDATE))

4

=file("lineitem_15_3.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

5

>A4.append(A3)

Calculation code:


A

1

=now()

2

1995-4-1

3

=days@o(elapse@m(A2,3))

4

=days@o(A2)

5

=file("lineitem_15_3.ctx").open().cursor@m(L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A4 && L_SHIPDATE<A3)

6

=A5.groups@n(L_SUPPKEY:supplier_no;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):total_revenue)

7

=A6.maxp@a(total_revenue)

8

=file("supplier_2.ctx").open()

9

=A7.joinx@q(supplier_no,A8:S_SUPPKEY,S_NAME,S_ADDRESS,S_PHONE)

10

=interval@ms(A1,now())

The groups@n in A6 refers to the sequence-number-based grouping mentioned in the previous article.

Test result:

Test items

Execution time (seconds)

General method

6

Data conversion

5

4. Column-wise computing

Calculation code:


A

1

=now()

2

1995-4-1

3

=days@o(elapse@m(A2,3))

4

=days@o(A2)

5

=file("lineitem_15_3.ctx").open().cursor@mv(L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A4 && L_SHIPDATE<A3)

6

=A5.groups@u(L_SUPPKEY:supplier_no;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):total_revenue)

7

=A6.maxp@a(total_revenue)

8

=file("supplier_2.ctx").open()

9

=A7.joinx@q(supplier_no,A8:S_SUPPKEY,S_NAME,S_ADDRESS,S_PHONE)

10

=interval@ms(A1,now())

Test result:

Test items

Execution time (seconds)

General method

6

Data conversion

5

Column-wise computing

2