Paper of Discrete Data Set
Below is the Supplementary reading
As we said in the abstract, the goal of discrete dataset model is to solve problems of relational algebra. The body of article explains how to define data types and related operations in discrete dataset model but does not deal with why they are defined in such ways. You need more to understand the design concept of the new computing model.
Thus, we provide information about the key differences between discrete dataset model and relational algebra. Hope you can understand better about the advantage of discrete dataset model.
The SQL programming language is developed based on relational algebra. A new programming language, SPL (Structured Process Language), is created based on discrete dataset model. In the following part, we use both SQL sample program and SPL sample program to display the differences of the two computing models.
1. Ubiquitous sets
The set concept in relational algebra only refers to the sets (or tables) made up of records (the more elusive mathematical term is Relation; here we use the practical, easier to understand term Record).
The set concept in discrete dataset model is more ubiquitous. It can deal with the set that consists of any type of data, i.e. the sequence. All sets, no matter whether they are made up of records, support common operations, such as aggregation and filtering.
In relational algebra, it appears cumbersome to understand a set of simple values as a single-field data table.
Specially, discrete dataset model supports a set consisting also of sets so that the true grouping operation can be achieved.
2. Discrete records
Records in discrete dataset model is a basic data type. They can exist independent of a data table. A data table is a set consisting of records. One or more records of this data table can also form another data table (record sequence). The filtering operation, for example, generates a new data table using records in the original data table that satisfy a specific condition. Such an operation gets an edge in terms of both space use and computing performance.
Relational algebra lacks an operational data type to represent records. A single record is, in essence, a one-row data table. Data tables cannot share records. As a result, a filtering operation must copy records to form a new data table, which increases both space and time costs.
By supporting discrete records, discrete dataset model allows a record’s field value to be a record. This makes it convenient to implement foreign key joins.
3. Orderliness
Relational algebra, however, is designed based on unordered set. It does not define ordinal numbers on members and provide location operations and the mechanism of referencing the directly next/previous member. Thanks to some partial engineering improvements and optimizations in implementations, contemporary SQL becomes able to handle certain order-based computations conveniently.
In discrete dataset model, sets are ordered. It defines the concept of ordinal numbers on members and location operations to return the ordinal number of a member. The member of a set can be accessed by its ordinal number. The computing model provides the brackets symbol [] to reference the directly previous/next member in a set operation and supports position-based computations on a set.
The commonly seen order-based computations are the chronic SQL headaches, even after window functions were introduced. SPL considerably improves the situation.
Example 1: Calculate the growth rate of each of the three dates having the highest stock prices.
SQL: A series of complicated actions are needed to implement the location operation even with the help of window functions:
SELECT date, price, seq, rate
FROM (
SELECT date, price, seq,
price/LAG(price,1) OVER (ORDER BY date ASC) rate
FROM (
SELECT date, price,
ROW_NUMBER() OVER (ORDER BY price DESC) seq
FROM stock ) )
WHERE seq<=3
SPL: It is simple to implement the location operation with the support of ordered sets.
T=stock.sort(date)
P=T.ptop(3,price)
T.calc(P,price/price[-1]-1)
Example 2: Count the longest consecutively rising days for a specific stock.
SQL: Without a convenient mechanism of referencing neighboring value, we need to get the task done in a roundabout way by converting it to the complex grouping computation.
SELECT MAX(ContinuousDays)
FROM (
SELECT COUNT(*) ContinuousDays
FROM (
SELECT SUM(RisingFlag) OVER (ORDER BY date) NoRisingDays
FROM (
SELECT date,
CASE WHEN price>LAG(price) OVER (ORDER BY date) THEN 0
ELSE 1 END RisingFlag
FROM stock ) )
GROUP BY NoRisingDays )
SPL: Having the neighboring value reference mechanism and the iteration function, it is easy to implement the task in SPL in the natural way of thinking.
n=0
stock.sort(date).max(if(price>price[-1],n+1,0))
4. Discreteness and set-lization
Relational algebra defines a wealth of set-based operations. A set-based operation is one in which sets are computed as an entity, such as aggregate and grouping operations. That is the edge of SQL over Java and other high-level languages.
But relational algebra has extremely poor discreteness that cannot support discrete records. High-level languages have great discreteness.
Discrete dataset model combines both discreteness and the feature of set-lization. It supports set data type and related operations as well as computing independent members of a set or reorganizing them to form a new set. SPL absorbs the advantages of both SQL and Java.
Order-based computations are typical scenarios of the cooperation of discreteness and set-lization. The concept of order makes sense only for members of sets. Order does not exist on a single member. That is the embodiment of set-lization. An order-based computation involves a specific member and at least one of its neighboring members. That requires discreteness.
A more thorough set-lization requires discreteness. Both are essential for tackling order-based computations.
Discrete dataset model is the algebraic system embracing both discreteness and thorough set-lization while relational algebra only features superficial set-lization.
5. The essence of grouping operations
The aim of a grouping operation is to divide a set into smaller sets according to a specific rule. Since relational algebra does not have a data type to represent a set of sets, the grouping action is always followed by aggregation.
Discrete dataset model allows a set of sets to hold the representable grouping result set. The set of groups and the aggregation on groups can thus be separated into independent steps. This allows us to perform further operations on the grouped subsets.
Relational algebra supplies only the equi-grouping, which divides a set according to grouping key values. The equi-grouping operation is a complete partition.
Discrete dataset model regards all methods that try to divide a set as grouping operations. Besides the ordinary equi-grouping, the computing model also offers order-based grouping that makes use of the orderliness feature and alignment grouping that may return a result set of incomplete partition.
We can rewrite the SQL implementation in above example 2 using SPL order-based grouping operation in a simpler way:
stock.sort(date).group@i(price>price[-1]).max(~.len())
6. The essence of aggregate operations
Relational algebra lacks explicit set data type. The result of an aggregate operation is always a single value. So do the post-grouping aggregate operations, which support only a few aggregates including SUM, COUNT, MAX and MIN. As the traditionally popular computing model cannot regard TOPN operation as a kind of aggregation, such a computation on a whole set gets the top N only after sort is performed during outputting the result set. Implementing a TOPN operation on grouped subsets is particularly hard. It needs to get it done in a roundabout way by attaching ordinal numbers to members.
Discrete dataset model supports sets in a more general meaning. Besides a single value, the aggregate result may also be a set. Like SUM and COUNT, in discrete dataset model, TOPN operation can be performed on both a single whole set and the grouped subsets.
Example 3: Select the three stocks having the highest closing prices on each date.
SQL: First index records and then select the eligible ones.
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (ORDER BY price DESC PARTITION BY date) seq
FROM stock )
WHERE seq<=3
SPL: Take getting TOPN as an aggregate operation and directly perform grouping and aggregation.
stock.groups(date;top(3;-price):top3).conj(top3)
After understanding TOPN as an aggregate operation, SPL avoids a full set sorting during practical implementation and, as a result, achieves high performance. The SQL’s TOPN solution is always accompanied with the ORDER BY action. It requires a full sorting in theory to get implemented and expects database software to do the engineering optimization in practice.
7. The essence of join operations
Relational algebra defines a join operation as the filtered Cartesian product and specifies no rules for the filtering condition. Theoretically, an operation is regarded as a join if its result is a subset of the Cartesian product.
Discrete dataset model divides join operations into three types:
1)Foreign key joins
Associate certain fields of table A to the primary key of table B. We call table B the foreign key table of table A, and table A the fact table.
2) Homo-dimension joins
Table A and table B are associated through their primary keys. Each table is the other’s homo-dimension table.
3) Primary-sub joins
Associate table A’s primary key to certain fields of table B’s primary key. Table A is table B’s primary table, and table B is table A’s subtable.
We can further combine the homo-dimension joins and primary-subtable joins into a larger category named primary key joins.
Features of discrete dataset model join operations are as follows:
1)It defines only equi-joins (through the condition that joining field values are equal);
2) It bases each join on the primary key of a certain table.
In real-world situations, most joins are equi-joins and almost all joins having business meaning involve the primary key. Error generally happens if a join does not involve the primary key (logically) because a many-to-many relationship will be generated.
Discrete dataset model also designs operations of filtered Cartesian product to handle a few cases that the above definition is unable to cover. They are not the common join operations, however.
Advantages are apparent after discrete dataset model remolds the join operation.
The differentiation of foreign key joins and primary key joins helps observe the structure of relationship between data tables more clearly. A foreign key table is usually for detailing information of certain fields and can be ignored when we focus on the fact table. Though both foreign key joins and primary-subtable joins could result in one-to-many association, they interpret a business analytics scenario from different perspectives.
The definition excludes many-to-many associations, getting rid of the corresponding wrong logics (which usually drive database to crash) resulted from the accidental omission of a certain join condition in handling multi-table joins.
By supporting discrete records, SPL allows assigning corresponding records in the foreign key table to a field in the fact table. Thus, the code becomes intuitive and error is avoided.
Example 4: Find the US employees whose managers are Chinese.
SQL: Association is not obvious and a table needs different aliases for two joins.
SELECT A.* FROM employee A
JOIN department B ON A.department=B.id
JOIN employee C ON B.manager=C.id
WHERE A.nation='US' AND C.nation='China'
SPL: Intuitive joining relationship once the foreign key association is established.
employee.switch(department,department)
department.switch(manager,employee)
// The above actions can be done at data loading; the actual query is just one line
employee.select(nation=="US" && department.manager.nation="China")
The definition of discrete dataset model joins brings in huge advantages in performance optimization, which we will talk about later.
The definition of relational algebra joins is simple, enables broad coverage yet lacks certain key features. The absence disables the popular computing model to simplify code and optimize performance using these features.
8. External storage computations
As a pure mathematical theoretical system, relational algebra does not take the differences of in-memory computations and out-of-memory computations into account.
Discrete dataset model, which also embraces the practical implementations, offers special design to achieve external storage computations.
Generally, data stored in a non-specific external storage device can only be accessed in order rather than randomly. Discrete dataset model abstracts the cursor object to handle it. It puts forward the cursor derivation frame and uses it to design cursor-related operations in similar ways to corresponding table-sequence-related operations. This increases the transparency of external storage computations as much as possible. The computing logic based on cursor operation can be easily applied to data outside memory.
Discrete dataset model also has composite tables that allow limited random accesses for external storage. Composite tables are specifically designed according to features of hard disks, the most common external storage device. The computing logic based on composite-table-related operation can be easily and efficiently implemented on hard disks.
The attached table mechanism allows a composite table to store multiple associated data tables. Generally, the primary key association is determined when the data structure is designed. People do not need to specify tables and joining fields ad hoc during the computing process. Storing tables on which a primary key join is to be performed in a combined way is equivalent to a pre-join. It helps to reduce storage and the computational complexity to perform a join and thus get better performance. The design is the result of a deep and unique understanding of join operations in discrete dataset model. The pre-join, however, is not suitable for foreign-key joins.
9. Order-driven high-performance
Discrete dataset model puts special emphasis on the order of a set because it implements many high-performance algorithms by making use of the orderliness. The unordered-set-based relational algebra cannot do this but turn to engineering optimization.
Here we list some simple computing scenarios where orderliness plays a key role.
1)
A composite table is ordered by the key. It means that a ready-made index already exists. This helps locate records quickly for a filtering on the key field and reduce the cost of traversal on external storage data. We can use binary search to locate and retrieve data randomly by key values and reuse the index to retrieve data by multiple key values.
2)
Generally, we use HASH algorithm to achieve grouping operations. If we are sure that data is ordered by the grouping key, we can perform neighboring values comparison only without calculating HASH values. This gets rid of HASH collision and makes it easy to implement parallel processing.
3)
A composite table is ordered by the key. The high-performance merge algorithm can be used to implement primary key join operation on two large composite tables. It traverses data only once without buffering data. This results in small memory usage. The conventional HASH partition method is complex and requires large memory space and external buffering. Moreover, a secondary HASH calculation and buffering is sometime needed if an inappropriate HASH function is chosen.
A composite table ordered by the key can be segmented according to the quantiles to facilitate parallel processing. The HASH algorithm, on the contrary, is much harder to increase the degree of parallelism due to too large consumption of the memory.
If one of the composite tables involved in a join becomes smaller after being filtered, it can make use of the ordered key to quickly locate target data in another composite table. A full table traversal is avoided.
4)
In another scenario, a large composite table acts as the foreign key table in a join. When there is a small fact table, we can quickly locate the matching data in the ordered foreign key table according to the join key values without the need of HASH partition action. When there is a large fact table, we can divide the foreign key table into logical segments according to the quantiles and then partition the fact table based on the logical segments. This way only one large table needs to be partitioned and a probable secondary partition due to the HASH partition method is avoided. The computing progress becomes much simpler.
Both 3)and 4) make use of the discrete dataset model’s remolding to join operations to implement the low-complexity algorithms that are hard to be implemented under the relational algebra definition (which may result in many-to-many associations).
Afterword
Those are the key differences between discrete dataset model and relational algebra. Discrete dataset model also has advantages for engineering implementation. These include providing conveniences to parallel processing, increasing foreign key join performance through large-memory-based pre-association, and offering certain techniques to facilitate practical applications, including the way to order data physically and to store data for randomly-segmented-based parallel processing. There are more to list. Though we have skipped these features that are not so theoretical but according to what we have briefly explained, differences of the two algebraic systems are already apparent and clear.
Now SPL has been successfully implemented and is available. You can find more and detailed information about SPL, including the theory behind it, the practical applications, and comparisons with SQL on relational databases in terms of agile computing and performance optimizations, in RaqForum .
This article focuses on OLAP analytics, or data computing, and does not involve OLTP tasks. More precisely, discrete dataset theory is better at supporting data warehouses. Yet, even as the theoretical base of OLTP analysis, relational algebra has lots of problems. It does not support heterogeneous data structures, costs too much in implementing transaction consistency, and has other troubles. We are trying to tackle these relational algebra OLTP difficulties and hope to put forward our results later.
Nearly half a century ago when relational algebra made its debut, the application requirements and computer hardware environment were a world of difference. It is normal for the then compatible computing model to feel rather awkward in our era. In this sense discrete dataset model can be considered as the ancient system’s contemporary advanced version, though not in the completely upwardly compatible fashion.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL