The Open-source SPL Redefines OLAP Server
OLAP, the abbreviation of Online Analytical Processing, queries and computes data and returns a result in real-time. Its application covers report viewing, data query, multidimensional analysis and all the other data computation tasks in daily business analytics that require returning results on the spot. OLAP Server is the product for meeting these requirements.
Status of OLAP Server
At present, almost all the mainstream OLAP Servers are big data platforms based on RDB or encapsulated as RDBs, which are similar to ROLAP (seldom mentioned these days) in the early days. One of the key features of them is using SQL as the query language.
Yet, the combination of RDB and SQL brings a lot of problems to OLAP Servers.
Complex reports difficulties
In fact, the lion’s share of OLAP analyses is reporting. Among the OLAP query requirements, a lot of have ready-to-use reporting query interfaces instead of free drag and drop multidimensional analyses, and over half of them are complex reports, whose typical characteristic is complicated data processing logic. Each report needs to be developed separately for data preparation. The popular way of handling this is to use complex SQL statements or stored procedures. For scenarios that databases cannot deal with (such as external data sources like files, cross-data-source computation and separation of front-end from back-end), users need to turn to Java, making the process extremely complicated.
SQL has a lot of troubles in achieving those requirements. Stored procedures have gradually lost popularity due to many shortcomings, too (un-migratable and have potential secure problems). It is inconvenient for Java to perform set-oriented operations and the high-level language cannot meet the dynamic reporting demands because of its inability to implement hot swap. In view of these, the contemporary OLAP Server gives unsatisfactory performance in achieving complex reports.
Incompetent self-service joins
Even we only look at the multidimensional analytics, the basic OLAP goals, SQL is not as qualified and widely-applicable as expected on most occasions. It can only handle single-table analytics without joins and manage to meet requirements of some fixed scenarios, but it lacks capabilities to deal with dynamic self-service analytics.
The closed system
OLAP Servers deeply relied on databases for operation. Databases lay stress on the concept of “base”. Data should be “loaded into the base” for processing, and only one database is handled at a time. You cannot process data outside the database at the same time. Though Titled Online Analytical Processing, OLAP is also responsible for performing T+0 real-time query analysis in real-world business situations. Data coming from the other sources needs to be loaded into the database through ETL before being computed, which is not real-time at all. OLAP, however, is typically expected to query real-time data in a business database and perform mixed query and analysis (T+0 query) between the real-time data (in the business database) and historical data (in the analytical database). OLAP Server cannot meet the demand. Moreover, it cannot compute data coming from many non-RDB databases.
Low performance
Even if our target is only the historical data instead of the real-time data and the computation involves one database, OLAP queries still have low performance. The issues we have include minutes delay for report queries, non-real-time responses for real-time queries, and getting stuck in the middle of multidimensional analysis. The fundamental cause lies in SQL. It is difficult for the relational algebra-based SQL to implement high-performance algorithms. Engineering optimizations on databases cannot resolve the underlying problem, and often become useless when the SQL statement is too complicated. The performance is thus still bad.
The open-source SPL redefines OLAP Server
The release of SPL makes a big difference to the awkward situation.
SPL is the abbreviation of Structured Process Language, the special programming language for computing structured data. It offers a wealth of class libraries and agile syntax to handle various complex data processing scenarios conveniently and efficiently. The language boasts computational capabilities independent of databases (or any other data sources), naturally supports diverse data sources to accomplish cross-data-source mixed computations and real-time queries on different type of sources. It also encapsulates plenty of built-in high-performance algorithms, and efficient storage scheme and parallel processing mechanism to achieve high performance.
Agile procedure coding catering for complex report queries
For handling complex data processing tasks, SPL has independent, agile syntax that supports procedure computations. Compared with SQL, SPL syntax is concise, making it more suited to handle complex report data source preparations.
Suppose we are trying to find the largest number of days when a stock rises consecutively:
SQL generates a four-level nested statement even when it uses window functions:
select max(continuousDays)-1
from (select count(*) continuousDays
from (select sum(changeSign) over(order by tradeDate) unRiseDays
from (select tradeDate,
case when closePrice>lag(closePrice) over(order by tradeDate)
then 0 else 1 end changeSign
from stock) )
group by unRiseDays)
It is much simpler to implement the logic using SPL:
A |
|
1 |
=T(“/dw/stockRecord.txt”) |
2 |
=A1.group@i(closePrice< closePrice[-1]).max(~.len()) |
SPL’s step-by-step coding makes it convenient to achieve a complex computation according to the natural way of thinking.
SPL also supplies rich class libraries to greatly simplify data processing.
It is hard to debug code in SQL. To solve this problem, SPL designs simple and easy-to-use development environment, a series of options, such as step and set breakpoint, WYSIWYG results view panel, and other functionalities.
A report will undergo a lot of additions and changes as business operates. A reporting tool can quickly make a report template, but it cannot handle complex dynamic report data source preparations. Both SQL and stored procedures and Java lack capabilities to handle them well.
Using SPL helps computerize report data source preparations. This and the reporting tools for data presentation will further automate the whole report development process, making the handling of endless report building tasks fast and cost-effective.
SPL is an interpreted execution programming language that naturally supports hot-swap. Any change to the report (data preparation process) will take into effect without restarting the application, making it convenient to build a constantly modified report.
Moreover, SPL’s agile and easy-to-switch syntax makes it integration-friendly with development frameworks such as microservices. The language has computational capabilities independent of databases so that it can process data originated from a microservice by storing the algorithm outside the database. This is better than Java hardcoding as it can effectively reduce coupling between modules in the application.
An open system
The conventional OLAP Server is a closed system while SPL-based OLAP Server is open. SPL has independent computational capabilities, and abandons the database constraints – even the concept of “base”, enabling it to compute data coming from any data sources directly, including CSV, Excel, JSON/XML, NoSQL, RestAPI, HDFS, Kafka, Elasticsearch and SAP, and handle mixed computations. The data source can come from a local application system or an external system or a remote cloud application.
It is convenient to achieve T+0 real-time data query using this open computing system. An application only needs to connect to the business database storing the hot data and the analytical database (or a file) containing cold data to perform a mixed computation.
High performance
Rather than based on the relational algebra, SPL establishes itself on the innovative algebra of discrete data set. This enables the language to achieve high-performance algorithms and storage schemes, which SQL finds hard to implement and which are key factors for performance enhancement, effortlessly.
For example, SPL is thoroughly set-oriented. It can treat TOPN as a kind of aggregate operation so that the highly complicated sorting can be converted into an aggregate operation of low complexity. And this makes it more widely-applicable.
A |
||
1 |
=file(“data.ctx”).create().cursor() |
|
2 |
=A1.groups(;top(10,amount)) |
Get orders whose amounts rank in top 10 |
3 |
=A1.groups(area;top(10,amount)) |
Get orders whose amounts rank in top 10 in each area |
SQL needs a full sorting when coding the above computation, leading to extremely low performance and deep depending on database optimizations. The database optimizer, however, becomes invalid once the computation becomes even a little complicated (such as A3 that also involves a grouping operation).
Here is another example. SPL supports multi-purpose cursor that generates multiple result sets during one traversal.
A |
||
1 |
=file(“order.ctx”).create().cursor() |
Get ready for traversal |
2 |
=channel(A1).groups(product;count(1):N) |
Configure multi-purpose computations |
3 |
=A1.groups(area;sum(amount):amount) |
Traverse the source table and obtain the grouping result sets |
4 |
=A2.result() |
Get result of A2’s computation |
SQL is unable to code the algorithm because its implementation involves multiple traversals of the large source table and thus cause low performance. The problem is only theoretical and database optimizer is helpless.
SPL also supplies a set of other technologies for optimizing OLAP analytics. They include order-based merge for achieving a join between orders and order details, the pre-association technique for implementing a multilevel dimension table join in multidimensional analysis, bit-based storage strategy for analytics involving thousands of tags, boolean type sets for speeding up queries with a filtering condition that has multiple enumerated values, time-interval-based grouping for achieving complex funnel analyses, double increment segmentation for achieving synchronized parallel processing on column-oriented stored data, to name a few. Many of them are unique SPL innovations.
According to tests based on international standard TPC-H, SPL runs several times faster on low-performance ARM chip than Oracle run on high-performance Intel chip. This is the strength of the SPL’s innovative algorithms.
With the support of high-performance algorithms and storage schemes, SPL achieves higher performance in computing historical big data, and increases efficiency of T+0 queries involving also the real-time hot data through mixed computations.
JOIN queries
To solve the multidimensional analysis’s inadequate joining ability problem that the conventional OLAP Server has, a SPL-based join query syntax DQL was created. The Dimensional Query Language is a SQL-like query language centered around dimensions. It adopts a different approach in handling table joins.
The SQL-based OLAP Server has only limited unsatisfactory approaches to deal with multi-table joins. One approach is using a logical wide table. It is hardly successful because it will have too many fields (generated from repeated copies of dimension fields, multilevel joins, self-joins and loop joins) and, even if it is actually used, has poor performance. Some BI products allow users to perform auto-joins on selected fields, but this applies only to simple scenarios. When a table contains homo-dimension fields (for instance, a table has more than two region fields), matching will fail, and the self-join cannot be implemented. Letting users be able to access all tables and fields and do any joins themselves are also infeasible.
DQL uses a different way to achieve a join query. Below is SQL statement:
--SQL
SELECT A.* FROM EMPLOYEE A, DEPARTMENT B, EMPLOYEE C
WHERE A.country='USA'
AND C. country ='China'
AND A. dept_id =B. dept_id
AND B. manager=C. emp_id
It involves multiple tables and self-joins. It is hard for business users to correctly express the associative relationships on a BI interface.
DQL codes the same query in the following way:
--DQL
SELECT * FROM EMPLOYEE
WHERE country ='USA' AND dept_id.manager.country ='China’
DQL converts the complex multi-table join into a simple single-table query, making it easy for ordinary business users to understand and implement the query on the interface.
Summary
The appearance of SPL and DQL will have a profound impact on OLAP Server.
SPL’s agile functionalities (procedure computation, external algorithms and interpreted execution) suited well to complex reporting tasks in OLAP analysis, and its rapid development, hot switch and loose coupling are integration-friendly with microservices. SPL’s open computational system and highly-inclusive data organization forms give it unprecedented openness the conventional closed OLAP products cannot have, enabling it to use various data sources directly and implement T+0 queries effortlessly. The SPL-based DQL provides the best solution to multidimensional analysis’ real-time join query performance issue. SPL’s high-performance algorithms and storage schemes ensures high OLAP performance for achieving report queries, T+0 queries, multidimensional analyses, and other query and analytical tasks.
We believe it’s just a matter of time that a SPL-based new-generation of OLAP Server and BI products are finally available.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version