"It is almost universally accepted in the industry and has been successfully practiced for many y .."

mars RaqForum 25 No.
1 Reply • 366 View • 1 Years ago

When TP database is too bloated, then add an AP database?

It is almost universally accepted in the industry and has been successfully practiced for many years that when the TP database is too bloated, then add an AP database. Is there anything to discuss?

The use of AP database can indeed alleviate the computational pressure of the TP database, and the AP database usually has better computational performance and can provide users with a better experience, which is indeed a common practice in the database field. However, it does not mean that this road is incredibly smooth.

The first is the cost. The operation and maintenance of databases have never been simple. A TP database already has many tasks, and now we need to add another one, with different types and knowledge reserves. The AP database is often a distributed MPP, which is not only expensive but also highly complex to operate and maintain. There are many new AP databases that are indeed fast, but their functionality is insufficient, and they do not support complex SQL or stored procedures enough, which can result in high transformation workload. Regardless, soaring costs are inevitable.

Secondly, migrating computing tasks to the AP database can also be awkward. Usually, not all tasks should be migrated at one time, as this not only involves a huge workload but also poses significant risks. A more secure approach is to gradually migrate, first moving the most stressful tasks from the TP database. After all, the TP database is only running slowly and not unusable. After sharing some of the pressure, it can also run smoother.

But in this way, the selection of the AP database is a challenge. We know that a database is a closed system, where all data and tasks are managed uniformly by the database. With the continuous addition of data and tasks, the content of database management is also increasing, which may cause the previously smooth tasks to become less smooth. When there were few migrations in the initial stage, of course, it ran smoothly. However, if it was found that the AP database was difficult to support in the later stage, a lot of work had already accumulated and could only be awkwardly expanded.

The closeness of the database can also cause T+0 (real-time calculation of whole data) issues. A single database naturally enables the calculation of whole data, without the problem of T+0. After installing the AP database, the data is split into two databases, and mixed computing with multiple databases is basically an impossible task for the database itself, especially when the AP and TP databases are of different types. Either seek help from professional data synchronization tools, but it is difficult to achieve smooth and timely development, or develop at the application layer, driving up development costs.

Using an HTAP database is also not a good solution, as the AP capability of such databases is usually insufficient, and it also requires replacing the original TP database, which is too risky.

Then it seems that when TP is too bloated, then add an AP, it looks very beautiful, but there are actually many problems.

Is there another option?

esProc SPL is a more lightweight and flexible solution.

esProc SPL is not a database. But it has high computing performance and can serve as an AP database. Here is a test result of TPCH 100G (in seconds):

	esProc SPL	StarRocks	Clickhouse	Oracle
q1	9.7	14.0	15.4	114.3
q2	1.3	0.6	17.3	1.9
q3	8.8	9.8	memory overflow	165.8
q4	4.9	5.7	memory overflow	158.4
q5	8.9	13.1	memory overflow	174.5
q6	4.5	3.9	4.8	126.7
q7	10.5	12.4	memory overflow	181.5
q8	6.9	8.3	memory overflow	209.7
q9	16.8	21.3	memory overflow	256.0
q10	8.3	11.1	58.3	195.6
q11	0.9	1.3	6.7	8.7
q12	4.9	4.8	10.7	186.0
q13	12.1	21.3	134.1	33.3
q14	3.3	4.6	10.2	170.0
q15	4.7	7.1	11.2	161.8
q16	2.7	2.9	4.0	10.8
q17	5.3	4.2	44.6	156.5
q18	6.4	20.8	memory overflow	416.8
q19	5.8	6.0	>600	144.1
q20	5.2	5.2	31.2	171.0
q21	11.9	14.5	syntax error	360.7
q22	2.5	1.9	8.4	37.7
Total	146.3	194.8	-	3441.8

It can be seen that the performance of esProc is orders of magnitude higher than that of TP databases represented by Oracle, and can be comparable to or even surpass professional AP databases. For detailed test reports, please refer to: SPL computing performance test series: TPCH .

With the guarantee of performance benchmark, let’s take a look at how esProc SPL can avoid the embarrassment of using an AP database.

Strictly speaking, esProc SPL is not a database, but a professional computing engine. It directly uses files to store data. esProc does not have the concept of “(data)base”, which means there is no concept of loading into database and exporting out of database, as well as inside or outside a database. Data files can be stored in directories, moved freely, and redundant, and even placed on the cloud, making management very free and convenient. The operation and maintenance complexity of data storage is much lower than that of databases.

Unlike convention databases, esProc, as a pure Java program, can be seamlessly embedded into Java applications for execution. The entire computing engine is built into the JDBC driver package, unlike databases that require an independent server process. The esProc core package is less than 15MB, and with various third-party data source drivers, it is only hundreds of MBs and can even run smoothly on Android. esProc is like the code written by programmers themselves, packed into a large package, running within the same process, and enjoying the framework advantages brought by mature Java frameworks together, and the operation and maintenance of computing tasks are also very simple.

esProc SPL does not require a unified management system. Data can be stored dispersed in files without metadata concept or constraints between data. The operational logic can also be dispersed across various applications, and subsequent tasks and predecessor tasks do not need to run on the same set of hardware resources at all. Resource grabbing and correlation coupling will not occur, let alone predecessor tasks being affected by subsequent tasks. In this way, we can easily and effortlessly migrate computing business from TP database to esProc (in fact, the term ‘to esProc’ is not very appropriate, the correct statement is that esProc takes on the computing task. esProc does not have a domain of its own management like a database, it is just a computing engine, without the concepts of ‘in’ or ‘out’).

Of course, esProc also allows users to centralize all computing tasks and execute them on designated resources. Whether it is a decentralized, centralized, or hybrid mode depends entirely on the needs and conditions of users. So, esProc is a very flexible solution, without limiting users’ freedom.

T+0 is even simpler. esProc SPL is an open computing engine that doesn’t even count as a logical database. It does not require pre-defined metadata, and any accessible data source can be calculated at any time. Both small data (in-memory tables) and large data (cursors) can be supported. This makes it easy to achieve cross source computing, especially mixed computing between TP database and esProc files. T+0 is a natural benefit brought by esProc.

esProc no longer uses SQL, but has its own programming language SPL. SPL has all the computing power of SQL:

Orders.sort(Amount) // sorting
Orders.select(Amount*Quantity>3000 && like(Client,"*S*")) // filtering
Orders.groups(Client; sum(Amount)) // grouping
Orders.id(Client) // distict
join(Orders:o,SellerId ; Employees:e,EId) // join
...

However, in any case, it requires mastering a new programming language and also rewriting the calculation tasks written in SQL into SPL. Users may be concerned that this will result in significant migration workload.

At first glance, it looks like this, but after careful analysis, it will be clear that the development workload using esProc is not much higher than using an AP database, and long-term development efficiency will actually be significantly improved.

The SQL corresponding to simple calculation tasks is also relatively simple, which can be easily ported to the AP database, and sometimes even run intact. If you want to change it to SPL, it may be a bit troublesome, although it is not difficult, after all, the syntax form is completely different and you always need to rewrite it. However, the real hassles are those complex tasks (the so-called 20/80 principle, where 80% of the workload is spent on 20% of difficult tasks), these SQL statements are often hundreds or thousands of rows, and even use database specific functions and even stored procedures. However, there is often a significant difference between the AP database and the TP database in this regard. Even if it is still written in SQL, the workload is close to redevelopment. If encountering functions that are not supported by the AP database (such as stored procedures), it may need to rely on external third-party programs or UDFs to implement.

SPL has much stronger computational capabilities than SQL, and for complex operations, SPL is much more convenient than SQL. For example, in this task, to calculate the longest consecutive days for a stock to rise, SQL needs to be written in multiple nested, lengthy, and difficult to understand:

select max(ContinuousDays) from (
    select count(*) ContinuousDays from (
        select sum(UpDownTag) over (order by TradeDate) NoRisingDays from (
            select TradeDate,case when Price>lag(price) over ( order by TradeDate) then 0 else 1 end UpDownTag from Stock ))
    group by NoRisingDays )

The same calculation logic is much simpler to write in SPL:

Stock.sort(TradeDate).group@i(Price<Price[-1]).max(~.len())

If this type of task is frequently encountered, which development workload is greater?

The difficulty of debugging SQL and stored procedures is notorious worldwide, and there are no signs of improvement yet. SPL has a very convenient debugging interface, which can also increase development efficiency:

(The SPL code is written in a grid, which is very different from ordinary programming languages. Please refer to here: A programming language coding in a grid )

SPL can implement more high-performance algorithms, and its computational performance is also stronger than SQL. It often achieves the effect a cluster using a single machine, and in the vast majority of cases, it can limit distribution while saving a large amount of procurement and operation costs. Due to space limitations, this topic will not be elaborated further. You can refer to Here comes big data technology that rivals clusters on a single machine .

Finally, esProc SPL is open source and free. It is here https://github.com/SPLWare/esProc.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

TalkingData

mars • 366 View • 1 Years ago