SPL Reporting: Lightweight, Low-Cost, Realtime Hot Data Reporting Solution

 

Realtime hot data reports are reports that can query all data, including both hot and cold, in real-time. In earlier times when businesses are based on one single TP database, there isn’t too much difficulty in generating such reports. But as data gets accumulated and needs to be split to special AP databases, things become different. After separation of cold data and hot data, developing a real-time report based on all data involves multi-database mixed computations. Moreover, the AP database and the TP database are different in type, and this further complicates the reporting.

The HTAP database combines TP and AP to achieve the earlier single-database architecture in order to avoid the cross-database mixed computation. However, migrating data to a new HTAP database poses high risks and involves high costs, and it may not be able to replicate the capabilities of the original TP database and AP database.

Using a logical data warehouse engine such as Trino or a stream processing engine such as Flink and Kafka, which is responsible for data transmission, can to some extent achieve real-time synchronization and mixed-source computations. The framework, however, is heavy and complicated, involving metadata management and the other cumbersome tasks. The code of performing stream computing isn’t concise enough, and real-time synchronization is highly resource-consuming. As a result, the construction and application costs are both very high.

SPL solution

..

SPL connects to the TP source and AP source via its native DataSource interface. Data in each of the two sources is processed separately and transmitted to SPL to be mixed together and computed, and results are passed to the report in real-time.

SPL has intrinsic mixed-source computing ability. It can directly access and retrieve data from any sources, ranging from AP databases to streaming mechanism such as Kafka, to various data warehouses and big data platforms, and to files and cloud object storages, and perform the mixed-source computation.

Find more about mixed-source computing in SPL Reporting: Make Lightweight, Mixed-Source Reports without Logical Data Warehouse.

SPL is very lightweight. There is almost no need to modify the original framework. This means the computation does not involve database migration and complicated framework changes. To enjoy the convenience of real-time query service, we just need to integrate SPL into the reporting application (seamlessly).

Moreover, SPL’s retrieving hot data from the TP database before performing the mixed-source computation is an ad hoc action. This framework does not require high realtimeness in dumping data into the AP database. We can first collate the data before synchronize them to the AP database in order to achieve better computing performance for the AP database.

Without further ado, let’s look at some examples where T+0 computations are implemented in SPL.

Examples

1. Business scenario

An ecommerce platform wants to monitor customer purchase actions in real-time and build a real-time product sales conversion rate report to help adjust the marketing strategy. The report should display the following indicators:

  • Each product’s views, purchase volume and conversion rate (purchase volume/views) per hour.

  • Real-time total sales for each product.

2. Target report

Generate a report updated in real-time, whose content is as follows:

PID

Views

PurVOL

CnvRT(%)

TSales(¥)

P001

500

100

20

15,000

P002

300

50

16.67

7,500

Views and Purchase volume: Obtained by summing up the real-time user clickstream data and the real-time order data respectively.

Total sales: Obtained from computing the real-time order data and historical order data.

Conversion rate: Obtained from computing the ratio of purchase volume to views in real-time.

3. Source of data

TP database (order data): It stores the latest order data; supports transaction operations and real-time writes.

Kafka stream data: Record the real-time customer action data (such as product view, add to cart and pay actions).

Hadoop storage (historical order data): It stores historical order data in the ORC format.

(1) TP database - orders table (the latest order data)

order_id

user_id

product_id

quantity

total_price

order_time

1001

2001

3001

2

400.00

2024-12-19 10:15:00

1002

2002

3002

1

150.00

2024-12-19 10:20:00

1003

2001

3003

3

600.00

2024-12-19 10:25:00

(2) Kafka stream data – customer action data (in the past hour)

product_id

views

P001

500

P002

300

(3) Hadoop storage - his_orders table (historical order data)

order_id

user_id

product_id

quantity

total_price

order_time

1001

2001

3001

2

400.00

2023-12-01 12:10:00

1002

2002

3002

1

150.00

2023-12-01 14:00:00

1003

2001

3003

3

600.00

2023-12-01 16:30:00

4. Write the script

Write SPL script realTimeOrder.splx (read the historical script from Hive):


A


1

=connect("mysqlDB")


2

=A1.query@x("SELECT product_id, SUM(quantity) AS total_sales, SUM(total_price) AS total_revenue FROM orders WHERE order_time > now()-1h GROUP BY product_id")

/Real-time order data

3

=kafka_open("/kafka/order.properties", "topic_order")


4

=kafka_poll(A3).value

/Real-time customer action data

5

=kafka_close(A3)


6

=hive_open("hdfs://192.168.0.76:9000", "thrift://192.168.0.76:9083","default","hive")

/Connect to Hive

7

=hive_table@o(A6)

/ Get the ORC table and file

8

>hive_close(A6)


9

=A7.select(tableName=="his_orders")


10

=file(A9.location).hdfs_import@c()

/Historical order data

11

=A10.select(interval(order_time,now())<=730)


12

=A11.groups(product_id;SUM(total_price):historical_revenue)


13

=join(A2:o,product_id;A4:v,product_id;A12:ho,product_id)

/Join the three parts of data

14

=A13.new(o.product_id:product_id,v.views:views,o.total_sales/v.views:conversion_rate,o.total_revenue+ho.historical_revenue:total_revenue)


15

return A14


The historical data can also be directly read from the HDFS, and the script will be like this:


A


1

=connect("mysqlDB")

/TP source

2

=A1.query@x("SELECT product_id, SUM(quantity) AS total_sales, SUM(total_price) AS total_revenue FROM orders WHERE order_time > now()-1h GROUP BY product_id")

/Real-time order data

3

=kafka_open("/kafka/order.properties", "topic_order")

/Kafka

4

=kafka_poll(A3).value

/Real-time customer action data

5

=kafka_close(A3)


6

=file("hdfs://localhost:9000/user/86186/orders.orc")

/Access HDFS directly

7

=A6.hdfs_import@c()

/Historical order data

8

=A7.select(interval(order_time,now())<=730)


9

=A8.groups(product_id;SUM(total_price):historical_revenue)


10

=join(A2:o,product_id;A4:v,product_id;A9:ho,product_id)

/Join the three parts of data

11

=A10.new(o.product_id:product_id,v.views:views,o.total_sales/v.views:conversion_rate,o.total_revenue+ho.historical_revenue:total_revenue)


12

return A11


The script is clear and concise.

5. Integrate SPL script into the reporting application

It is simple to integrate the SPL script into the reporting application. Just need to import two SPL jar files – esproc-bin-xxxx.jar and icu4j-60.3.jar (generally located in [installation directory]\esProc\lib) to the application, and then copy raqsoftConfig.xml (located in the same directory) to the application’s class path.

raqsoftConfig.xml is SPL’s core configuration file. Its name must not be changed.

Configure the external library in raqsoftConfig.xml. To use Kafka:

<importLibs>
    <lib>KafkaCli</lib>
</importLibs>

MySQL source also needs to be configured. Configure the connection information under DB node:

<DB name="mysqlDB">
    <property name="url" value="jdbc:mysql://192.168.2.105:3306/raqdb?useCursorFetch=true"/>
    <property name="driver" value="com.mysql.cj.jdbc.Driver"/>
    <property name="type" value="10"/>
    <property name="user" value="root"/>
    <property name="password" value="root"/>
    …
</DB>

SPL encapsulates the standard JDBC interface. The report can access SPL by configuring the JDBC data source. For example, the configurations of report data sources are as follows:

<Context>
    <Resource name="jdbc/esproc"
              auth="Container"
              type="javax.sql.DataSource"
              maxTotal="100"
              maxIdle="30"
              maxWaitMillis="10000"
              username=""
              password=" "
              driverClassName=" com.esproc.jdbc.InternalDriver "
              url=" jdbc:esproc:local:// "/>
</Context>

The standard JDBC call is the same as that for an ordinary database, so we will not elaborate further.

Then call the SPL script from the report data set in the way of calling the stored procedure to get SPL’s computing result.

Call the order data computation script realTimeOrder.splx:

call realTimeOrder()

To output the final result set to the report.