Performance Optimization Exercises Using TPC-H – Q5

 

SQL code and analysis

Below is the SQL query statement:

select
    n_name,
    sum(l_extendedprice * (1 - l_discount)) as revenue
from
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
where
    c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and l_suppkey = s_suppkey
    and c_nationkey = s_nationkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'ASIA'
    and o_orderdate >= date '1995-01-01'
    and o_orderdate < date '1995-01-01' + interval '1' year
group by
    n_name
order by
    revenue desc;

This is a grouping & aggregation operation on the filtered result set of multi-table association that contains a primary-sub table association between orders table and lineitem table.

SPL solution

The working principle is similar to that in Q3.


A

1

=now()

2

1995-1-1

3

=elapse@y(A2,1)

4

>name="ASIA"

5

=file("region.btx").import@b().select(R_NAME==name)

6

=file("nation.btx").import@b().switch@i(N_REGIONKEY,A5:R_REGIONKEY).keys(N_NATIONKEY)

7

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY;A6.find(S_NATIONKEY)).fetch().keys@im(S_SUPPKEY)

8

=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NATIONKEY;A6.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)

9

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE < A3,O_CUSTKEY:A8)

10

=file("lineitem.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A7)

11

=A10.select(O_CUSTKEY.C_NATIONKEY==L_SUPPKEY.S_NATIONKEY)

12

=A11.groups@u(L_SUPPKEY.S_NATIONKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

13

=A12.sort@z(revenue).run(N_NAME=A6.find(N_NAME).N_NAME)

14

=interval@ms(A1,now())

Here we use the techniques of performing filtering at cursor creation, order-based primary-sub table association, and converting join field to foreign key table pointers.

Unlike Q3, the grouping field isn’t the ordered L_ORDERKEY and group@s() function becomes unsuitable. A12 performs grouping on S_NATIONKEY instead of N_NAME because integer comparison is faster than string comparison, and replace the field with N_NAME after the grouping operation is done.

Further optimization

1. Optimization method

In this example, we will use the date-integer conversion optimization method explained in Q1 – the orders table’s OR_ORDERDATE field is already converted in Q3 – and the dimension table primary key numberization – region table’s R_REGIONKEY field, nation table’s N_NATIONKEY and N_REGIONKEY fields, and supplier table’s S_SUPPKEY and S_NATIONKEY fields are already converted in Q2; and orders table’s O_CUSTKEY is already converted in Q3. Here we only need to convert customer table’s C_NATIONKEY field and lineitem table’s L_SUPPKEY field.

2. Code for data conversion

2.1 Conversion on region table, nation table and supplier table

Copy region_2.btx, nation_2.btx and supplier_2.ctx, and rename them region_5.btx, nation_5.btx and supplier_5.ctx.

2.2 Conversion on orders table

Copy orders_4.ctx and name it orders_5.ctx.

2.3 Conversion on customer table


A

1

=file("customer_3.ctx").open().import()

2

=file("nation.btx").import@b()

3

=A1.run(C_NATIONKEY=A2.pselect(N_NATIONKEY==C_NATIONKEY))

4

=file("customer_5.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT)


>A4.append(A3)

2.4 Conversion on lineitem table


A

1

=file("supplier.ctx").open().import(S_SUPPKEY).keys@im(S_SUPPKEY)

2

=file("lineitem_4.ctx").open().cursor()

3

=A2.run(L_SUPPKEY=A1.pfind(L_SUPPKEY))

4

=file("lineitem_5.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)


>A4.append(A3)

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(region, file(“region_5.btx”).import@b())

2

>env(nation, file("nation_5.btx").import@b())

3

>env(customer, file("customer_5.ctx").open().import())

4

>env(supplier, file("supplier_5.ctx").open().import())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

1995-1-1

3

=days@o(elapse@y(A2,1))

4

=days@o(A2)

5

>name="ASIA"

6

=nation.(region(N_REGIONKEY).R_NAME==name)

7

=supplier.@m(if(A6(S_NATIONKEY), S_NATIONKEY,null))

8

=customer.@m(if(A6(C_NATIONKEY), C_NATIONKEY,null))

9

=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3,O_CUSTKEY:A8:#)

10

=file("lineitem_5.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A7:#)

11

=A10.select(O_CUSTKEY ==L_SUPPKEY)

12

=A11.groups@u(L_SUPPKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

13

=A12.sort@z(revenue).run(N_NAME=nation(N_NAME).N_NAME)

14

=interval@ms(A1,now())

We use two optimization techniques in the code – the alignment sequence after dimension table primary key numberization and association with dimension table row numbers.

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

1995-1-1

3

=elapse@y(A2,1)

4

>name="ASIA"

5

=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY

6

=file("nation.btx").import@b().select(N_REGIONKEY==A5).derive@o().keys@i(N_NATIONKEY)

7

=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NATIONKEY;A6.find(S_NATIONKEY)).fetch().keys@im(S_SUPPKEY)

8

=file("customer.ctx").open().cursor@mv(C_CUSTKEY,C_NATIONKEY;A6.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)

9

=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3,O_CUSTKEY:A8)

10

=file("lineitem.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A7)

11

=A10.select@v(O_CUSTKEY.C_NATIONKEY==L_SUPPKEY.S_NATIONKEY)

12

=A11.groups@u(L_SUPPKEY.S_NATIONKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

13

=A12.sort@z(revenue).new(A6.find(S_NATIONKEY).N_NAME,revenue)

14

=interval@ms(A1,now())

A11 uses @v option in select() function that performs filtering on the column-wise cursor to keep the filtered cursor still as a column-wise cursor.

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(region, file(“region_5.btx”).import@bv())

2

>env(nation, file("nation_5.btx").import@bv())

3

>env(customer, file("customer_5.ctx").open().import@v())

4

>env(supplier, file("supplier_5.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

1995-1-1

3

=days@o(elapse@y(A2,1))

4

=days@o(A2)

5

>name="ASIA"

6

=nation.(region(N_REGIONKEY).R_NAME==name)

7

=supplier.(A6(S_NATIONKEY))

8

=customer.(A6(C_NATIONKEY))

9

=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3 && A8(O_CUSTKEY))

10

=file("lineitem_5.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;A7(L_SUPPKEY))

11

=A10.derive@o(supplier(L_SUPPKEY).S_NATIONKEY,customer(O_CUSTKEY).C_NATIONKEY).select@v(C_NATIONKEY==S_NATIONKEY)

12

=A11.groups@u(S_NATIONKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

13

=A12.sort@z(revenue).new(nation(S_NATIONKEY).N_NAME,revenue)

14

=interval@ms(A1,now())

A9 uses A8(O_CUSTKEY) to only filter data at cursor creation without performing switch association. The reason is explained in Q2, and same operation for A10. A11 uses derive@o() to add two new columns and select@v to keep result as a column-wise cursor.

Test result

Unit: Second


Regular

Column-wise

Before optimization

23.5

9.8

After optimization

11.5

5.7