Generate TPCH data using SPL
The TPC official website provides a C language generation program, which can be downloaded, compiled, and executed to generate TPCH data.
However, many people are not familiar with the C language and its environment, and are stuck at this step.
According to official rules, a SPL script has been written here to generate TPCH data. This way, as long as esProc is installed, data can be easily generated.
TPCH has 8 tables, with 5 fixed records in the region table and 25 fixed records in the nation table. The remaining 6 tables are generated based on the Scale Factor (SF). When SF=1 (i.e. 1G), the number of records in each table is as follows:
Supplier table: 10000 pieces;
Customer table: 150000 entries;
Part table: 200000 entries;
Part Supplier Table: 800000 pieces;
Orders table: 1.5 million items;
Lineitem table: 6 million entries.
When SF=n, the scale of each table expands to n times.
When generating data, different fields have different generation rules:
The primary key field is a self-increasing integer starting from 1, such as N_NATIONKEY, S_SUPPKEY, etc.;
The name field consists of a fixed prefix and a primary key sequence number, such as Supplier#000000001, Customer#000000001;
Some numerical fields are randomly generated within a certain range, such as O_TOTALPRICE(857555285), L_TAX(00.08);
There are also some enumeration fields that are randomly selected among several state values, such as L_RETURNFLAG (values A/N/R), L_LINESTATUS (values O/F);
The order date is a random day between 1992 and 1998.
In addition to the simple rules mentioned above, there are also rules that relate different table fields to each other:
The ‘partsupp’ table contains four times the data of the ‘part’ table, with each part randomly corresponding to four suppliers;
One order (one orders record) corresponds to the L_LINENUMBER (1-7) details in the lineitem table, with probabilities of [1,0.9, 0.8, 0.5, 0.4, 0.3, 0.1] for each of the 7 details. Overall, the number of items in the lineitem table is almost four times that of the orders table.
The L_SHIPDATE and L_COMMITDATE in the lineitem table should be randomly generated within 50 days after the relevant order date in the orders table, while the L_RECEIPTDATE should be randomly generated within 10 days after the L_SHIPDATE.
The above rules have been implemented in generateTpchData.splx, and you can ‘Download Here’ to understand the detailed SPL script.
When executing this script, you can modify the value of SF in cell A1 to generate as many GB of data as needed; The cell B1 specifies the folder directory for generated data, which defaults to the same directory as the script file generateTpchData.splx, or can be changed to other absolute path (e.g.>tpchFolder=“D:/tpch/”); When withTitle is true in C1, the first row in the generated file is the column name. Otherwise, there are no column names in the file, only data:
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese Version