How to Group Huge Volume of Data Fast
How to Group Huge Volume of Data Fast
Key words: Slow database grouping queries Big data Storage format
Grouping queries are all too familiar in database querying. You have to do a full traversal no matter how you arrange or index data. As an index is useless, storage format becomes the key factor that affects the traversal speed. Though databases use the binary storage format, their IO performance is not even satisfactory. Traversal of database data is fast but retrieving external data is very slow.
Take Oracle for an example. The source data looks like this:
ORDERID CLIENT SELLERID AMOUNT ORDERDATE NOTE
1 287 47 5825 2013-05-31 gafcaghafdgie f ci…
2 89 22 8681 2013-05-04 gafcaghafdgie f ci…
3 47 67 7702 2009-11-22 gafcaghafdgie f ci…
4 76 85 8717 2011-12-13 gafcaghafdgie f ci…
5 307 81 8003 2008-06-01 gafcaghafdgie f ci…
6 366 39 6948 2009-09-25 gafcaghafdgie f ci…
7 295 8 1419 2013-11-11 gafcaghafdgie f ci…
8 496 35 6018 2011-02-18 gafcaghafdgie f ci…
9 273 37 9255 2011-05-04 gafcaghafdgie f ci…
10 212 0 2155 2009-03-22 gafcaghafdgie f ci…
…
The actual data volume (25G) exceeds Oracle’s available maximum memory capacity (12G). To perform group & aggregation with Oracle’s Parallel Execution, the SQL query is:
select /*+ Parallel(8) */ client,sellerid,count(orderid),sum(amount) from orders group by client,sellerid
It takes 210 seconds to execute the SQL query.
Then we try handling the grouping over same data under same environment with esProc using the following script. And the execution time is about 65 seconds!
A |
|
1 |
=$(esProcOdbc) select /*+ parallel(8) */ client,sellerid,count(orderid),sum(amount) from orders.btx group by client,sellerid |
How does esProc get it done so fast? It stores data in the bin file (esProc’s own binary file format). Data storage format determines the traversal performance. Below shows features and performance ranking of different storage formats:
Storage format |
Feature |
Performance ranking |
Binary |
Smallest space usage & fastest parsing speed |
1 |
Text |
Universal in representation but bad in performance |
2 |
Database |
Binary; bad IO performance / Fast database data traversal & slow external data retrieval |
3 |
So, storage format should be the first and foremost concern in performance optimization for big data processing. To save the precious database space and increase computing performance, we can read data that used only for OLAP analysis out of the database to be stored in more efficient format. Read Performance Optimization - Traversal to learn more.
An esProc SPL script can be easily embedded into a Java program. Read How to Call an SPL Script in Java to learn details.
Read Getting Started with esProc to download and install esProc, get a license for free and find related documentation.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL