SPL practice: improve concurrency through route calculation

 

Problem description

System architecture

The frontend application is a BI system that is mainly to implement query based on some filter conditions, and the backend is a certain MPP database. The frontend generates the SQL suitable for wide table ‘orders’ (order table) based on user’s selection and operation on the page, and then submit the SQL to the backend, and finally display the result returned from backend.

The generated SQL is roughly as follows:

select top 1000 ORDERID,CUSTOMERNAME,EMPLOYEEID,ORDERDATE,AMOUNT
  from orders
  where ORDERDATE=date('2023-11-22') and AMOUNT>100
  and (CUSTOMERNAME='GIUPK' or EMPLOYEEID=87041)

In the where condition, the date is a required field (it means to only query one day’s data each time); other fields and filter conditions can be freely combined; each query returns 1000 records.

Data structure and data scale

Data structure of the ‘orders’:

Field name Field type Field meaning Sample data
ORDERID String Order number 10777
CUSTOMERNAME String Customer name ERNSH
EMPLOYEEID String Employee number 7
ORDERDATE Date Date 2023-10-16
AMOUNT Number Amount 96.5

This table stores years of order data to date, with a daily data volume of 30 million.

Condition and expectation

Due to the need to undertake other computing tasks such as batch job, the backend database is under heavy workload, and can only provide 5 connections to the frontend. However, the frontend is used by hundreds of users, it will be very slow to response if many users access it simultaneously.

Statistics show that over 90% of the requests from the frontend are targeted at hot data, that is, the data from the last three days.

The method to achieve expectation: store the hot data in a location outside the database (for example, put it locally in frontend application), and determine whether or not to query the data of the last three days based on the date condition. If so, access the hot data; if not, still access the database.

Problem analysis

A common method to solve this problem is to deploy a frontend database. Although this method solves the problem of storing hot data, it faces significant challenge in terms of computation.

When the frontend submits a query request, it needs to determine the location of the data to be queried (frontend database or backend database) based on the date condition. Such calculation is actually to select the path of data, which can be referred to as route calculation.

However, since the database is closed in computing ability and can only calculate the data within the database, it is difficult to follow the route calculation rules, transfer the query and merge results etc.

Moreover, the frontend BI system is a software suite that cannot be modified, making it impossible to automatically select different databases on the frontend based on query condition. So, users have to select manually, but this will result in a very poor user experience.

Programmable route calculation

In practice, the route calculation rules may be complex and varied, which makes it very difficult to implement route calculation by configurating. The best method for such calculation is programming.

SPL is a programming language specifically designed for computing structured and semi-structured data, and boasts higher programming efficiency, allowing us to easily implement flow controls such as judging, branching and looping. Moreover, SPL supports code hot swap, making it the simplest and most efficient way to implement complex route calculation rules.

Unified interface

When calling esProc through JDBC driver, we can configurate a SPL script file to serve as JDBC gateway, and let this file process and execute all SQL statements executed in JDBC. In this way, a unified entry is made available for the SQL statements submitted from frontend, allowing us to write the route calculation rules in the SPL script conveniently. Alternatively, we can write the rules in other script and let this script call the rules.

If what the front end submits is not SQL statements, it doesn’t matter, because SPL also supports JDBC stored procedure or restful, allowing for directly calling the SPL script.

Parsing and translation of SQL

Depending on the open computing ability of SPL, we can easily utilize the SQL parsing function to split the SQL statements passed in from the frontend to obtain the date parameter in the WHERE clause, hereby identifying the data range involved in query.

Some dialects will exist when the types of databases are different. The implementation of route calculation is inseparable from the translation of SQL statement.

SPL provides the SQL translation function, which can translate the standard SQL submitted from the frontend to the SQL of the corresponding database.

Practice process

Prepare the backend database

Execute orders.sql to import the sample data into the Oracle database.

Click to download: orders.sql

Prepare the front-end application

We use the jsp in the tomcat server to simulate BI, report or other query systems.

Prepare the tomcat server and deploy SPL’s JDBC driver in it. For detailed instructions, please visit http://doc.raqsoft.com/esproc/tutorial/.

Prepare the raw data

In practice, a scheduling system will regularly export the data of each day from the database to a text file and send the file to SPL. For the convenience of verification, we assume that the current date is November 24, 2023.

The text files are generated directly here:

A B C
1 =[date(2023,11,21),date(2023,11,22),date(2023,11,23)]
2 for A1 =file("orders"/string(A2,"yyyyMMdd")/".txt")
3 =movefile(B2) =0
4 for 30 =to(1000000).new(C3+~:ORDERID,rands("ABCDEFGHIJKLMNOPQRSTUVWXYZ",5):CUSTOMERNAME,rand(100000):EMPLOYEEID,A2:ORDERDATE,rand()*10000:AMOUNT)
5 >B2.export@at(C4)
6 >C3+=1000000

This code generates three text files:

orders20231121.txt
orders20231122.txt
orders20231123.txt

Each text file contains one day’s order data (30 million).

ETL process

Text files take up a lot of disk space, resulting in poor computing performance, we can save local files as SPL bin file.

In the preparation stage of putting the system online, it needs to export the hot data of three days from the database to text files and then convert the text files to SPL bin file.

After the system goes online, use the scheduling tool to export the order data of the day from the database to a text file every night, and then call the ETL code of SPL to convert the file to SPL composite table.

The ETL code (etl.splx) is roughly as follows, with the input parameter being the order date (orderdate):

A
1 =file("orders"/string(orderdate,"yyyyMMdd")/".txt")
2 =A1.cursor@t()
3 =file("orders"/string(orderdate,"yyyyMMdd")/".btx")
4 =A3.export@b(A2)

The scheduling tool can call etl.splx with command line. The command line in Windows is roughly as follows:

C:\Progra~1\raqsoft\esProc\bin\esprocx etl.splx 2023-11-22

esprocx.exe is the command line execution program of SPL (in Linux, the execution program is esprocx.sh).

etl.splx is the SPL script file to be executed, and 2023-11-22 is a given parameter.

Front-end web page code

We use jsp to simulate frontend application:

<%@ page language="java" import="java.util.*" pageEncoding="utf-8"%>
<%@ page import="java.sql.*" %>

<body>
<%
String driver = "com.esproc.jdbc.InternalDriver";
String url = "jdbc:esproc:local://";
try {
Class.forName(driver);
Connection conn = DriverManager.getConnection(url);
Statement statement = conn.createStatement();

String sql="select top 1000 ORDERID,CUSTOMERNAME,EMPLOYEEID,ORDERDATE,AMOUNT from orders where ORDERDATE=date('2023-11-22') and AMOUNT>100 and (CUSTOMERNAME='GDPAL' or EMPLOYEEID=47811)";

out.println("orderID"+"\t"+"customername"+"\t"+"employeeID"+"\t"+"orderdate"+"\t"+"orderamount"+"<br>");
ResultSet rs = statement.executeQuery(sql);
int f1,f6;
String f2,f3,f4;
float f5;
while (rs.next()) {
f1 = rs.getInt("ORDERID");
f2 = rs.getString("CUSTOMERNAME");
f3 = rs.getString("EMPLOYEEID");
f4 = rs.getString("ORDERDATE");
f5 = rs.getFloat("AMOUNT");
out.println(f1+"\t"+f2+"\t"+f3+"\t"+f4+"\t"+f5+"\t"+"<br>");
}
out.println("</pre>");

rs.close();
conn.close();
} catch (ClassNotFoundException e) {
System.out.println("Sorry,can`t find the Driver!");
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
%>
</body>

Frontend configuration

Configurate the raqsoftConfig.xml of the frontend application in tomcat as follows:

<JDBC>
<load>Runtime,Server</load>
<gateway>gateway.splx</gateway>
</JDBC>

After configuration, when calling esProc JDBC in Java, all SQL statements to be executed will be sent to the gateway script gateway.splx to process.

Note that gateway.splx should be placed under the main directory configured in raqsoftConfig.xml, see the tutorial for details.

Gateway script

The given parameter of the gateway script gateway.splx must include sql and args. The code is:

A B C
1 =today=date(2023,11,24) =sql.sqlparse@w().split(" ")
2 =B1.select@1(like(~,"ORDERDATE=date('????-??-??')"))
3 =date(mid(right(A2,14),3,10)) =today-A3
4 if B3<=3 =connect() =replace(sql,"from orders","from orders"/string(A3,"yyyyMMdd")/".btx")
5 =B4.cursor@x(C4) return B5
6 else =connect("orcl") =sql.sqltranslate(“ORACLE”)
7 =B6.cursor@x(C6) return B7

A1: here we assume today’s date is 2023,11,24, which needs to be modified to =today=now() in practice;

B1: use SQL parsing function to obtain the where clause in arg_sql;

A2, A3: obtain the date to be queried; B3: calculate the number of days between the current date and the date to be queried;

A4: determine the date difference and implement route calculation;

If the difference is within three days, execute B4 to C5, query and retrieve data from the local btx, and return the result.

If the difference is more than three days, execute B6 to C7, query and retrieve data from the backend database, and return the result. The SQL translation function used in C6 is to translate standard SQL statements in sql to Oracle SQL statements.

Practice effect

Coding in SPL to implement the above route calculation is very simple and debugging is very convenient. It only took one week from the start of programming to the end of testing. Moreover, SPL itself is very light, and the installation and configuration of SPL are simple and easy to learn. We spent only one day in installing SPL and putting the system online.

After the system goes online, the hot data is stored locally in the front-end application, and 90% of query requests are no longer limited by the 5 connections provided by the backend database, allowing for accessing by hundreds of users without any pressure and, and only 10% of requests are transferred to the backend, reducing the pressure on the backend database.

Postscript

In actual business scenarios, the route calculation rules are complex and diverse. For example:

The local database stores the hot data and the backend database stores the cold data;

There are two databases on the backend, one stores historical data and the other stores current data;

The BI, report or query application on the frontend needs to connect to different databases of multiple applications, and it may also involve mixed calculation across multiple databases;

The route calculation rules also change frequently, for example:

Users may think that the scheme of storing only the data of the last three days in local application is not the best scheme, and hope to modify the scheme as permanently storing the data of the last one day in memory, storing the data of the last 2 to 4 days in local disk, storing the data of the last month on a separate SPL server, and still storing other data in the backend database.

In these cases, using SPL to implement route calculation can quickly meet the needs of various complex rules, as well as the needs of frequent changes in rules.

In particular, SPL scripts support hot swap and the script will take effect immediately after modification and deployment without having to shut down and restart server.