Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times

 

Abstract

Only with one day is the query on detail data speeded up by more than 2000 times. It’s amazing! Click to learn more:Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times

Problem description

In the business system of insurance company C, the query speed for the detail data of group insurance is very slow. After entering the policy number when querying, it needs to return the information of all insurants included in the group policy. For a relatively small policy with 10,000 insurants, it takes 7.5 minutes before the return page appears; while for a larger policy with 1 million insurants, the return page still doesn’t appear after four hours.

Since the detail data of group insurance is relatively large, they are stored in two databases. Each group policy data may appear in both databases. The Oracle database is adopted, and there are 163 lines of SQL statements in total, as shown in the figure below:

..

Analysis and solution

In the face of performance problem, we need to carefully analyze the characteristics of data and calculation, and find out performance’s key points so as to gradually optimize the performance by changing the data storage scheme and calculation method.

Step 1: Confirm the requirement premise. The query on detail data of group insurance is a function in the application system, which needs to query the latest data. If the data was exported for calculation on schedule during ETL, this requirement would not be satisfied. Therefore, we still need to find ways to fetch the data from the database and calculate outside the database to optimize performance.

Step 2: Understand the characteristics of business requirement. The detail data of group insurance are stored in two databases db1 and db2, and each database has two tables m1 and m2. When querying, it needs to put these four tables together to query, and we collectively call them the group insurance table.

Although the four tables are different, they all have main to-be-queried fields: policy No., insurance member No., number of endorsement times, business No. 1, business No. 2, business mark, and personal information such as name, gender, age, etc.

The “endorsement” refers to the adjustment of insurance contract. The system will also save the adjusted latest insurance details in the group Insurance table, and won’t modify original insurance details so as to keep the trace. In the data, the endorsement is reflected in the field of “number of endorsement times”. When querying, it needs to query the data with maximum number of endorsement times, that is, the latest data.

There are also some invalid data in the detail data. To find invalid data, it needs to check whether the string concatenated by business No. 1 and business mark is in the invalid set. The invalid set refers to the set formed by the strings concatenated by the business No. 2 and the letter A when the number of endorsement times is less than 9 and the business mark is D or U in the data of the same policy No. If the string is concatenated by the business No. 1 and the business mark appears in the invalid set, this record is an invalid record and should be discarded.

Step 3: Disentangle and study the calculation process. Although SQL code is relatively long, it can be divided into several parts. Part 1 is the four group insurance tables of two databases, query the required data respectively according to policy No., and then union the results. Part 2 is the condition filtering, including the condition for discarding invalid data and several other simple conditions. Part 3 is to use the window function row_number() OVER(PARTITION BY insurance member No. ORDER BY number of modification times desc) to search the detail record with the maximum number of endorsement times.

When Part 1 is executed separately, the amount of result data returned is tens of thousands to millions, resulting in a relatively long return time. If the database JDBC cursor is used, it can return part of the data quickly, for example, thousands of records can be returned in a few seconds.

In Part 2, fetching the invalid set from the database alone only takes a few seconds, and the amount of returned result data is not large, which allows an all-in-memory operation.

However, when Part 1 & 2 are executed in a combined manner, the speed will be very slow, even if the cursor is used to return the results in batches. If the execution of Part 3 is performed at the same time, the speed will be slower.

Step 4: Design the presentation scheme. According to the segmentation execution situation of SQL, it is determined to adopt the method of streaming big report to achieve speed-up. The principle is as follows:

..

The data-fetching from the databases and the data presentation are performed by two asynchronous threads. After the fetching thread sends SQL, the data will be continuously fetched, and then cached locally following the complex calculations. Subsequently, the presentation thread gets the data from the local cache for display. In this way, the data that has been fetched and cached can be displayed quickly, and there is no longer a sense of waiting.

Step 5: Design the optimization scheme of calculation process. We intend to implement the data-fetching and computing in three stages.

Stage 1: As mentioned above in Part 1, after adding the sorting in descending order by the insurance member No. and the number of endorsement times, using database JDBC cursor can still quickly fetch some data in batches. With the addition of sorting, it can ensure that the data of an insurance member is fetched adjacently when the data are fetched in batches. In the subsequent stage 3, the latest data with the maximum number of endorsement times can be quickly found.

Stage 2: We fetch the invalid sets of this policy into memory at one time, and filter the data fetched in batches from stage 1 so as to calculate the valid details that meet the condition. Since there are not many invalid data, not many detail data will be filtered out.

Stage 3: According to whether the insurant No. has changed, judge whether it is the first piece of data of the insurant. Because the detail data are ordered by the insurant and the number of endorsement times, when the insurant No. changes, the first piece of data is the maximum number of endorsement times for the current insurant. In this way, it plays the same role as the window function mentioned above.

Since the amount of data of each insurant is not large, usually a dozen or so pieces of data at most, (corresponding to several or a dozen number of endorsement times), and there are not many invalid data, the amount of data fetched in batches in part 1 does not need to be many, and hence we can return data in batches to the front end. This is the necessary condition for rapid presentation of streaming big report.

Step 6: Design the code implementation scheme. Using the delayed cursor method to implement the above three stages. The principle of a delayed cursor is to define the cursor calculation of the three stages in turn, and the calculation is not really executed when defining, but is executed after the three stages are defined. The advantage of delayed calculation is that the calculations of three stages can be done in only one traversal, and it does not need to generate intermediate results to occupy the memory, in this way, the query results can be submitted to the front end in batches for presentation.

The cursor calculation in stage 3 is more complicated and needs to use the program cursor to implement. The principle is as follows:

..

When the program cursor is called, it should calculate and return the results at the same time, only this way can meet the requirements of streaming large report.

Actual effect

After the optimization scheme is drawn up according to the calculation characteristics, it needs to select an appropriate tool to implement the performance optimization of calculation and presentation. Of course, it can be implemented by directly using Java, but the amount of coding is too large and the project period is too long, and hence it is prone to a hidden trouble with code errors, and difficult to debug and maintain. In contrast, the open-source esProc SPL language provides all the above algorithm support, including some mechanisms such as delayed cursor, orderly & segmented data-fetching with cursor, and program cursor, which allows us to quickly implement this personalized calculation with less amount of code. The front-end presentation requires a reporting tool that supports the streaming big report mechanism, we choose RaqReport.

In this way, we completed the verification of performance optimization after only one day of programming, debugging and testing, and the response speed of query became very fast. For example, for a relatively small policy with 10,000 insurants, it used to take 7.5 minutes for the return page to appear, after optimization, the first page of the report can present in only 3 seconds; while for a larger policy with 1 million insurants, the original situation is that the return page didn't appear after waiting for 4 hours, after optimization, the first page of report can present in only 7 seconds, and the response speed is increased by more than 2000 times.

To reduce the programming difficulty, SPL has made a lot of encapsulations, provided rich functions, and built-in the basic algorithms and storage mechanisms required by the above optimization scheme. Therefore, the actual code is very short and the development efficiency is very high. The code for data fetching in the above three stages has only such a few lines:

..

Postscript

To solve the performance optimization problem, the most important thing is to design a high-performance computing scheme to effectively reduce the computational complexity, thereby ultimately increasing the speed. Therefore, on the one hand, we should fully understand the characteristics of calculation and data, and on the other hand, we should have an intimate knowledge of common high-performance algorithms, only in this way can we design a reasonable optimization scheme according to local conditions. The basic high-performance algorithms used herein can be found at the course:  , where you can find what you are interested in.

Unfortunately, the current mainstream big data systems in the industry are still based on relational databases. Whether it is the traditional MPP or HADOOP system, or some new technologies, they are all trying to make the programming interface closer to SQL. Being compatible with SQL does make it easier for users to get started. However, SQL, subject to theoretical limitations, cannot implement most high-performance algorithms, and can only face helplessly without any way to improve as hardware resources are wasted. Therefore, SQL should not be the future of big data computing.

After the optimization scheme is obtained, we also need to use a good programming language to efficiently implement the algorithms. Although the common high-level programming languages can implement most optimization algorithms, the code is too long and the development efficiency is too low, which will seriously affect the maintainability of the program. In this case, the open-source SPL is a good choice, because it has enough basic algorithms, and its code is very concise, in addition, SPL also provides a friendly visual debugging mechanism, which can effectively improve development efficiency and reduce maintenance cost.

For the report presentation in this example, a reporting tool that can support streaming presentation is also required. In this regard, RaqReport has unique advantages, because it can start presenting without the need to take out all the data, and can support high-speed page turning forward and backward without relying on database’s paging mechanism (this mechanism may cause data inconsistency). In this way, the business users can get a good experience.


.