Performance Optimization Exercises Using TPC-H – Q16

 

SQL code and analysis

Below is the SQL query statement:

select
    100.00 * sum(
        case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount)
            else 0 end)
        / sum(l_extendedprice * (1 - l_discount) ) as promo_revenue
from
    lineitem,
    part
where
    l_partkey = p_partkey
    and l_shipdate >= date '1995-04-01'
    and l_shipdate < date '1995-04-01' + interval '1' month;

This is a grouping & aggregation operation on the filtered result set of multi-table association.

SPL solution

The query performs matching according to two foreign key tables part and supplier, filtering and then grouping, where the grouping operation is based on two foreign key table fields. Here we use the optimization methods previously mentioned. First we filter the foreign key tables according to the specified condition , match the primary table’s join field with the foreign key tables while converting the matched fields into pointers of records in the foreign key tables. The non-matching records will be filtered away during the process, and we can directly reference the foreign key table fields during the later grouping operation.


A

1

=now()

2

>brand="Brand#21"

3

>type="SMALL"

4

>sizes=[2,15,17,23,25,41,44,45]

5

=file("part.ctx").open().cursor@m(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=brand && !pos@h(P_TYPE,type) && sizes.contain@b(P_SIZE)).fetch().keys@im(P_PARTKEY)

6

=file("supplier.ctx").open().cursor@m(S_SUPPKEY;!like(S_COMMENT,"*Customer*Complaints*")).fetch().keys@im(S_SUPPKEY)

7

=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:A5,PS_SUPPKEY:A6)

8

=A7.groups@u(PS_PARTKEY.P_BRAND,PS_PARTKEY.P_TYPE,PS_PARTKEY.P_SIZE;icount(PS_SUPPKEY):supplier_cnt)

9

=A8.sort@m(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)

10

=interval@ms(A1,now())

A5 and A6 retrieve and filter the two foreign key tables respectively. A7 creates cursor while performing matching and filtering.

Note that A5 uses contain@b in the in judgement, which means the binary search will be enabled. When there are a lot of members in the set on which the in judgment works, sorting members and using binary search can reduce the number of comparisons and increase performance.

Further optimization

1. Optimization method

In this example, we will use the string-integer conversion method explained in Q1 to convert part table’s P_BRAND field, and dimension table primary key numberization method – part table’s P_PARTKEY field and supplier table’s S_SUPPKEY field have been converted in the previous examples.

2. Code for data conversion

2.1 Conversions on supplier table and partsupp table

Copy supplier_15.ctx and partsupp_11.ctx, and rename them supplier_16.ctx and partsupp_16.ctx respectively.

2.2 Conversion on part table


A

1

=file("part.ctx").open().cursor().fetch()

2

=A1.id(P_BRAND).sort()

3

=file("p_brand.txt").export(A2)

4

=A1.run(P_PARTKEY=#, P_BRAND=A2.pos@b(P_BRAND))

5

=file("part_16.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)

6

>A5.append(A4.cursor())

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(p_brand,file("p_brand.txt").import@si())

2

>env(part, file("part_16.ctx").open().import())

3

>env(supplier, file("supplier_16.ctx").open().import())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

>brand=p_brand.pos@b("Brand#21")

3

>type="SMALL"

4

>sizes=[2,15,17,23,25,41,44,45]

5

=part.@m(if(P_BRAND!=brand && sizes.contain@b(P_SIZE) && !pos@h(p_type(P_TYPE),type),~,null))

6

=supplier.@m(!like(S_COMMENT,"*Customer*Complaints*"))

7

=file("partsupp_16.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:A5:#,A6(PS_SUPPKEY))

8

=p_brand.len()+1

9

=A7.groups@u(PS_PARTKEY.P_TYPE*A8*46+PS_PARTKEY.P_SIZE*A8+PS_PARTKEY.P_BRAND:gk;icount(PS_SUPPKEY):supplier_cnt)

10

=A9.new@m(p_brand(gk%A8):P_BRAND,p_type(gk\(A8*46)):P_TYPE,gk%(A8*46)\A8:P_SIZE,supplier_cnt)

11

=A10.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)

12

=interval@ms(A1,now())

A5, A6 and A7 use the alignment sequence technique explained in Q2. A9 uses the grouping key technique explained in Q1 to convert the three-field based grouping operation to a grouping operation by a single filed gk. A10 then computes the original values of the three grouping fields from gk.

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

>brand="Brand#21"

3

>type="SMALL"

4

>sizes=[2,15,17,23,25,41,44,45].i()

5

=file("part.ctx").open().cursor@mv(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=brand && sizes.contain@b(P_SIZE) && !pos@h(P_TYPE,type)).fetch().keys@im(P_PARTKEY)

6

=file("supplier.ctx").open().cursor@mv(S_SUPPKEY;!like(S_COMMENT,"*Customer*Complaints*")).fetch().keys@im(S_SUPPKEY)

7

=file("partsupp.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:A5,A6.find(PS_SUPPKEY))

8

=A7.groups@u(PS_PARTKEY.P_BRAND,PS_PARTKEY.P_TYPE,PS_PARTKEY.P_SIZE;icount(PS_SUPPKEY):supplier_cnt)

9

=A8.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)

10

=interval@ms(A1,now())

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(p_brand,file("p_brand.txt").import@si())

2

>env(part, file("part_16.ctx").open().import@v())

3

>env(supplier, file("supplier_16.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

>brand=p_brand.pos@b("Brand#21")

3

>type="SMALL"

4

>sizes=[2,15,17,23,25,41,44,45].i()

5

=part.@m(P_BRAND!=brand && sizes.contain@b(P_SIZE) && !pos@h(p_type(P_TYPE),type))

6

=supplier.(!like(S_COMMENT,"*Customer*Complaints*"))

7

=file("partsupp_16.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY;A6(PS_SUPPKEY) && A5(PS_PARTKEY))

8

=A7.derive@o(part(PS_PARTKEY):p)

9

=p_brand.len()+1

10

=A8.groups@u(p.P_TYPE*A9*46+p.P_SIZE*A9+p.P_BRAND:gk;icount(PS_SUPPKEY):supplier_cnt)

11

=A10.new@m(p_brand(gk%A9):P_BRAND,p_type(gk\(A9*46)):P_TYPE,gk%(A9*46)\A9:P_SIZE,supplier_cnt)

12

=A11.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE)

13

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

8.7

2.0

After optimization

7.2

1.3