Performance optimization case course: TPCH-Q9

 

select
    nation,
    o_year,
    sum(amount) as sum_profit
from
    (
        Select
            n_name as nation,
            extract(year from o_orderdate) as o_year,
            l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
        from
            part,
            supplier,
            lineitem,
            partsupp,
            orders,
            nation
        where
            s_suppkey = l_suppkey
            and ps_suppkey = l_suppkey
            and ps_partkey = l_partkey
            and p_partkey = l_partkey
            and o_orderkey = l_orderkey
            and s_nationkey = n_nationkey
            and p_name like '%chocolate%'
    ) profit
group by
    nation,
    o_year
order by
    nation,
o_year desc;

This is a grouping and aggregation operation on the filtered result set of multi-table association, which involves a primary-sub table association between orders table and lineitem table and an association with the foreign key table partsupp having a two-field primary key.

1. Data storage

Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining. For other tables, there is no special sorting requirement, they are all stored in order by primary key.

So, we can continue to use orders.ctx, lineitem.ctx and customer.ctx from Q3.

For tables part, supplier and partsupp, we can directly use part.ctx, supplier.ctx and partsupp.ctx from Q2.

For tables region and nation, we can directly use region.btx and nation.btx from Q2.

Copy these tables to the main directory of this query.

2. General method


A

1

=now()

2

>name="chocolate"

3

=file("nation.btx").import@b().keys@i(N_NATIONKEY)

4

=file("part.ctx").open().cursor@m(P_PARTKEY;pos(P_NAME, name)).fetch().keys@im(P_PARTKEY)

5

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY).fetch().keys@im(S_SUPPKEY)

6

=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A4.find(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

7

=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A4.find(L_PARTKEY))

8

=file("orders.ctx").open().new@r(A7,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE)

9

=A8.join@i(L_PARTKEY:L_SUPPKEY,A6,PS_SUPPLYCOST).switch(L_SUPPKEY,A5)

10

=A9.groups(L_SUPPKEY.S_NATIONKEY:nationName,year(O_ORDERDATE):o_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY):profit)

11

=A10.run(nationName=A3.find(nationName).N_NAME).sort(nationName,-o_year)

12

=interval@ms(A1,now())

There is no filtering on orders table. The filtering action only happens on lineitem table. The number of data rows after filtering is much less than that in orders table, so we handle the primary-sub table association differently from the previous example. First, A7 creates a lineitem table cursor. Then A8 filters the primary table orders according to the cursor. Then, the number of data rows retrieved is considerably decreased and efficiency is increased.

A9 uses join@i method to join the foreign key table’s PS_SUPPLYCOSTA field to the cursor. As the foreign key table has a two-field primary key, the method of matching and conversion at cursor creation becomes inapplicable.

A10 performs grouping on NATIONKEY instead of the string NAME, and A11 converts the grouping result to string.

Test items

Execution time (seconds)

General method

37

3. Data conversion

Utilize the conversion methods mentioned in previous articles: convert primary key of dimension table to sequence number, convert date to integer.

For tables region, nation, supplier and part, we can directly use region_2.btx, nation_2.btx, supplier_2.ctx and part_2.ctx converted in Q2.

To convert the enumeration string field type in the part table to number, we can use p_type.btx from Q2.

For the customer table, we can use customer_3.ctx from Q5.

For the orders table, we can use orders_5.ctx from Q3.

For the lineitem table, we can use lineitem_4.ctx from Q8.

Copy these tables to the main directory of this query.

Calculation code:


A

B

1

=now()


2

>name="chocolate"


3

=file("nation_2.btx").import@b()


4

= file("part_2.ctx").open()

=A4.cursor@m().skip().(false)

5

=A4.cursor@m(P_PARTKEY;pos(P_NAME, name)).fetch().(B4(P_PARTKEY)=true)

6

=file("supplier_2.ctx").open()


7

=A6.cursor@m(S_NATIONKEY).fetch().(S_NATIONKEY)

8

=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;B4(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

9

=file("lineitem_4.ctx").open().cursor@m(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;B4(L_PARTKEY))

10

=file("orders_5.ctx").open().new@r(A9,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE)

11

=A10.select(L_PARTKEY=A8.find(L_PARTKEY,L_SUPPKEY).PS_SUPPLYCOST).switch(L_SUPPKEY,A7:#)

12

=A3.len()+1


13

=A11.groups(year(O_ORDERDATE)*A12+L_SUPPKEY:gk;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-L_PARTKEY*L_QUANTITY):profit)

14

=A13.new(A3(gk%A12).N_NAME:nationName,gk\A12:o_year,profit).sort(nationName,-o_year)

15

=interval@ms(A1,now())


A11 searches A5 for target records according to L_PARTKEY and L_SUPPKEY, and assigns L_PARTKEY as PS_SUPPLYCOST values – during which the current PS_SUPPLYCOST value is null if no corresponding A5 record is found and select() function will filter away the current record. This method has equal performance with the previously explained join@i method.

A13 uses the grouping key technique explained in Q1 to convert the grouping operation by two fields into one by a single field gk.

Test result:

Test items

Execution time (seconds)

General method

37

Data conversion

29

4. Column-wise computing


A

B

1

=now()


2

>name="chocolate"


3

=file("nation_2.btx").import@b()


4

=file("part_2.ctx").open()

=A4.cursor@m().skip().(false)

5

=A4.cursor@mv(P_PARTKEY;pos(P_NAME, name)).fetch().(B4(P_PARTKEY)=true)

6

=file("supplier_2.ctx").open()


7

=A6.cursor@m(S_NATIONKEY).fetch()

8

=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;B4(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

9

=file("lineitem_4.ctx").open().cursor@mv(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;B4(L_PARTKEY))

10

=file("orders_5.ctx").open().new@r(A9,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE)

11

=A10.join@i(L_PARTKEY:L_SUPPKEY,A8,PS_SUPPLYCOST;L_SUPPKEY,A7:#,S_NATIONKEY)

12

=A3.len()+1


13

=A11.groups(year(O_ORDERDATE)*A12+S_NATIONKEY:gk;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY):profit)

14

=A13.new(A3(gk%A12).N_NAME:nationName,gk\A12:o_year,profit).sort(nationName,-o_year)

15

=interval@ms(A1,now())



=now()


For column-wise computing, try to avoid using switch() function. Therefore, instead of using switch to assign value to the existing column, A11 uses join to concatenate new column.

Test result:

Test items

Execution time (seconds)

General method

37

Data conversion

29

Column-wise computing

12