"Ⅰ SQL code and analysis Below is the SQL query statement: select sum(l_extendedprice) / 7.0 as a .."

Jerry RaqForum 14 No.
367 View • 1 Years ago

Performance Optimization Exercises Using TPC-H – Q17

Ⅰ SQL code and analysis

Below is the SQL query statement:

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,part
where
    p_partkey = l_partkey
    and p_brand = 'Brand#33'
    and p_container = 'LG DRUM'
    and l_quantity < (
        select
            0.2 * avg(l_quantity) 
        from
            lineitem
        where
            l_partkey = p_partkey
    );

Such an association with the primary table according to the equivalence condition in the subquery can be always converted to a JOIN operation, and the SQL statement can be rewritten as:

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,part,
    (select 
        l_partkey lp, 
        0.2*avg(l_quantity) lq
    from 
        lineitem
    group by 
        l_partkey
    ) lpq
where
    p_partkey = l_partkey
    and p_brand = 'Brand#33'
    and p_container = 'LG DRUM'
    and l_partkey = lp
    and l_quantity < lq;

The subquery is a grouping & aggregation operation and the main query is an aggregate operation on the result set of multi-table association.

Ⅱ SPL solution

	A
1	=now()
2	>brand="Brand#33"
3	>container="LG DRUM"
4	=file("part.ctx").open().cursor@m(P_PARTKEY;P_BRAND==brand && P_CONTAINER==container).fetch().keys@im(P_PARTKEY)
5	=file("lineitem.ctx").open()
6	=A5.cursor@m(L_PARTKEY,L_QUANTITY;A4.find(L_PARTKEY))
7	=A6.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(avg/=5).keys@im(L_PARTKEY)
8	=A5.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A7)
9	=A8.total(sum(if(L_QUANTITY<L_PARTKEY.avg,L_EXTENDEDPRICE,0)))/7.0
10	=interval@ms(A1,now())

Ⅲ Further optimization

1. Optimization method

In this example, we will use the string-integer conversion method explained in Q1 – part table’s P_BRAND field is already converted in Q16 and its P_CONTAINER field is waiting to be converted, and the dimension table primary key numberization method explained in Q2 – part table’s P_PARTKEY field is already converted in Q16 and lineitem table’s L_PARTKEY field has been converted in the previous example.

2. Code for data conversion

2.1 Conversion on lineitem table

Copy lineitem_15.ctx and rename it lineitem_17.ctx.

2.2 Conversion on part table

	A
1	=file("part_16.ctx").open().cursor().fetch()
2	=A1.id(P_CONTAINER).sort()
3	=file("p_container.txt").export(A2)
4	=A1.run(P_CONTAINER=A2.pos@b(P_ CONTAINER))
5	=file("part_17.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)
6	>A5.append(A4.cursor())

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:

	A
1	>env(p_brand,file("p_brand.txt").import@si())
2	>env(p_container,file("p_container.txt").import@si())
3	>env(part, file("part_17.ctx").open().import())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:

	A
1	=now()
2	>brand=p_brand.pos@b("Brand#33")
3	>container=p_container.pos@b("LG DRUM")
4	=part.@m(P_BRAND==brand && P_CONTAINER==container)
5	=file("lineitem_17.ctx").open()
6	=A5.cursor@m(L_PARTKEY,L_QUANTITY;A4(L_PARTKEY))
7	=A6.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(A4(L_PARTKEY)=avg*0.2)
8	=A5.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A4:#)
9	=A8.total(sum(if(L_QUANTITY<L_PARTKEY,L_EXTENDEDPRICE,0)))/7.0
10	=interval@ms(A1,now())

Ⅳ Using enterprise edition’s column-wise computation

1. Original data

	A
1	=now()
2	>brand="Brand#33"
3	>container="LG DRUM"
4	=file("part.ctx").open().cursor@mv(P_PARTKEY;P_BRAND==brand && P_CONTAINER==container).fetch().keys@im(P_PARTKEY)
5	=file("lineitem.ctx").open()
6	=A5.cursor@mv(L_PARTKEY,L_QUANTITY;A4.find(L_PARTKEY))
7	=A6.groups@u(L_PARTKEY;avg(L_QUANTITY)*0.2:avg).keys@im(L_PARTKEY)
8	=A5.cursor@mv(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A7)
9	=A8.total(sum(if(L_QUANTITY<L_PARTKEY.avg,L_EXTENDEDPRICE,0)))/7.0
10	=interval@ms(A1,now())

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:

	A
1	>env(p_brand,file("p_brand.txt").import@si())
2	>env(p_container,file("p_container.txt").import@si())
3	>env(part, file("part_17.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:

	A
1	=now()
2	>brand=p_brand.pos@b("Brand#33")
3	>container=p_container.pos@b("LG DRUM")
4	=part.@m(P_BRAND==brand && P_CONTAINER==container)
5	=file("lineitem_17.ctx").open()
6	=A5.cursor@mv(L_PARTKEY,L_QUANTITY;A4(L_PARTKEY))
7	=A6.groups@u(L_PARTKEY;avg(L_QUANTITY):avg)
8	clear A4
9	=part.(null)
10	=A7.run@m(A9(L_PARTKEY)=avg*0.2)
11	=A5.cursor@mv(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A9:#)
12	=A11.total(sum(if(L_QUANTITY<L_PARTKEY,L_EXTENDEDPRICE,0)))/7.0
13	=interval@ms(A1,now())

Ⅴ Test result

Unit: Second

	Regular	Column-wise
Before optimization	15.4	7.7
After optimization	14.8	6.8

In this example, there are only over 600,000 rows of data in the filtered lineitem table. The data size is too small to give full play to the numberization method, and the optimization shows little effect. If we modify the filtering condition in A4 by changing && to ||, there are over 38 million rows left in the filtered lineitem table. In this case, the test shows a significant optimization effect.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Jerry • 367 View • 1 Years ago