Is distributed technology the panacea for big data processing?

 

Using distributed cluster to process big data is the mainstream at present, and splitting a big task into multiple subtasks and distributing them to multiple nodes for processing can usually achieve significant performance improvement. Therefore, whenever it is found that the processing capability is insufficient, adding nodes to expand the capacity is the easiest way for many supporters to think of. As a result, when we are introduced to a new big data processing technology, the first question we often ask is whether it supports distributed technology and how large a cluster it can support, which indicates that “distributed thinking” is already deeply rooted in our minds.

So, is distributed technology really the panacea for big data processing?

Of course not. There is no panacea for all diseases in the world. Similarly, any technology has its own application scenarios, and so does distributed technology.

Whether the processing capability problem can be solved with distributed technology should be determined based on the characteristics of the calculation task. If a task can be split easily, the distributed technology works well; On the contrary, if a task is relatively complex, and after splitting, it needs to couple and reference the subtasks from each other or even needs to perform a large amount of cross-node data transmission, the distributed technology may not work well. If you insist on using it, the effect may be worse.

Specifically, the distributed technology is more suitable for most transactional (OLTP) scenarios, for the reason that a single task involves a small amount of data but the concurrency number is large, and the task can be split easily (although it will involve a small number of distributed transactions, there are already mature technologies to handle).

For analytical (OLAP) tasks, it is a bit more complex and should be determined depending on specific situation. For some simple query tasks, such as querying the detail data of account (the health QR code query recently popular in China belongs to this kind of query), the distributed technology works well. Such query tasks are characterized as huge data volume in total, small data volume in each account, and each query only needs to find the data of one account and does not involve complex computation. Since such queries are very similar to the characteristics of the above OLTP scenario where a single task involves small and mutually irrelevant data, it is easy to split the query task. In this case, adding distributed nodes can effectively improve the query efficiency, and the distributed technology can be called a panacea.

But for more complex computing scenarios, distributed technology may not work well. Let’s take the common association operation as an example, the association operation will have Shuffle action in a distributed environment, and the data needs to be exchanged between nodes. When the number of nodes is large, the network delay caused by data exchange will offset the performance improvement from multi-machine shared computing. In this case, adding more nodes will decrease the performance rather than improve the performance. That is why many distributed databases set an index “upper limit of node number”. Moreover, this index is usually very low, which is a few dozen or at most a hundred.

More importantly, the computing power of distributed cluster cannot be extended linearly. A cluster consists of multiple physical machines, and they communicate through the network. When one node in the cluster is insufficient, it needs to access the memory of other nodes through network. However, the network is only suitable for bulk access, yet accessing the memory often involves random and small pieces of data. In this case, cross-node memory access through the network will result in a significant decrease in performance, usually by one or two orders of magnitude. To make up for performance loss, it needs to add hardware resources by several times or even tens of times. Although the cluster can increase the computing power, it cannot extend linearly. With the limited number of nodes, cluster plays a very limited role. In this case, there is nothing distributed technology can do for those who want to use distributed technology to play “unlimited computing power”.

In practice, there are many more complex computing scenarios. For example, the common batch job is to process business data into to-be-used results in the free time each day (such as at night). Such jobs are extremely complex; not only is the computing rule complex, but it also needs to accomplish multi-step calculations one by one in order. In the process of a batch job, it will involve a large amount of historical data, and it may need to read and associate historical data repeatedly; this will make it difficult to employ distributed technology. Even if it is possible to split the calculation task, it will often generate intermediate results during data processing, and these results need to be stored for use in the next step. Since it is impossible to distribute these temporarily generated intermediate results to other nodes in a timely manner (these results cannot be redundantly prepared in advance), other nodes have to exchange data through the network before proceeding with calculation, which will greatly decrease the performance. For such complex computing scenarios, it is not easy to take advantage of distributed technology, not to mention the limitation to the number of nodes of distributed technology. As for the panacea, it’s absolutely impossible. Therefore, such complex businesses are still executed on a single large database, which is not only costly, but the capacity easily reaches the upper limit as the number of tasks increases.

So, what should we do when the computing performance in such scenarios hits a bottleneck and cannot be solved in distributed technology?

To solve this problem, we first need to analyze the characteristics of such operations and the reason for the slow operation.

In fact, a deeper study on the characteristics of such scenarios will reveal that many “slow” operations do not involve a very large amount of data. Such operations are usually performed based on structured data that focus on business data. Although the total data volume is large, the data volume in a single task is not large, usually involving tens to hundreds of GBs only and rarely reaching one TB. Let’s take a typical batch job of a bank as an example; suppose there are 20 million accounts in this bank, and each account has an aggregate record each month, and the batch job usually uses the historical data of the past year, the total data volume will be less than 300 million rows. Assuming that each record has 100 statistical values, and the size of each row is estimated as 1K, the total physical size is about 300G, which can be compressed to below 100G with some simple compression technologies. Such a data size can usually be processed on a single machine.

Since the data volume is not large, why does it run so slowly, and why does it frequently occur that a batch job needs hours to accomplish?

There are two main reasons.

First, the calculation is complex. Although the data volume is not large, the association occurs repeatedly during the calculation. As the calculation amount increases, the performance will certainly decrease. Let’s take an extreme example: a computing scenario of NAOC on clustering the celestial bodies is exactly the situation where the data volume is small but the computational complexity is high, resulting in poor performance. In this scenario, there are 11 photos (data) in total, and each photo has 5 million celestial bodies, with the total data volume of no more than 10G. Now we want to cluster together the celestial bodies whose positions are close to each other (astronomical distance) to form a recalculation property. Although the data volume of this task is not large, the amount of calculation is very large, which is proportional to the square of the data size. The number of times to calculate the astronomical distance is about 5 million * 5 million * 10 pieces=250 trillion, which is really an astronomical number. When a certain distributed database was employed to carry out this task, running on 100 CPUs, it still took 3.8 hours to process 500 thousand celestial bodies, and it was estimated that it would take 15 days to process 5 million celestial bodies (yet users hope that the task can be accomplished in a few hours).

Second, the computing performance of a single machine is not fully utilized. In other words, the utilization of hardware resource is low, which is closely related to the data processing technology used. At present, we still use SQL (database) as the main tool to process structured data, this is the important reason for not being able to give the computing performance of a single machine into full play. Because SQL lacks some key data types (such as record type) and basic operations (such as ordered computing), many high-performance algorithms cannot be described. As a result, slow algorithms have to be adopted. Although many databases are now optimized in practice, they can only deal with simple scenarios. Once the computing task becomes complicated, the optimizer of the database will fail, therefore, it cannot solve the root of these problems. This explains the reason why the computation time of the cluster in the NAOC example above still cannot meet the requirements even with the aid of 100 CPUs when coding in SQL.

In fact, if the data processing technology can choose appropriate algorithms based on actual computing scenario, it can reduce the computational complexity and improve the computing performance. The key here is that not only do you need to think of a high-performance algorithm, but you also need to write it out in code. Unfortunately, it is very difficult to achieve this goal in SQL; even if a high-performance algorithm is come up with, it cannot be implemented, and finally you will be at your wits’ end.

In addition to SQL, emerging computing technologies like Spark also have the problem of poor performance (low resource utilization). The RDD in Spark adopts the immutable mechanism, and a new RDD will be copied after each calculation step, resulting in a large occupation and waste of memory and CPU resources. Since the resource utilization is very low, you have to use large cluster and large memory to meet performance requirement.

Therefore, if you want to improve computing efficiency by making full use of hardware resource, you need to choose other technologies. SPL, is exactly such a technology.

Similar to SQL, SPL is also a computing engine specifically for structured data. The difference is that SPL adopts a more open computing system and provides many high-performance algorithm implementation mechanisms (and corresponding high-performance storage schemes). With these mechanisms and schemes, not only can the said goal be achieved, but the goal can be achieved easily. In this way, the role of hardware resources can be brought into full play, and jobs that originally need a cluster can be implemented without a cluster, or you can use a small cluster to replace a large cluster.

Let’s take the above-mentioned NAOC example again. SPL could not speed up if it still performed 250 trillion comparisons. However, we can find ways to optimize the algorithm. Specifically, when solving this problem, we can exploit the monotonicity and orderliness of the distance between celestial bodies to do a coarse screening and then quickly limit the possibly matched celestial bodies to a small range using the binary search. In this way, most of the comparisons can be avoided, and the computational complexity can be reduced to 1/500 of the original one. Finally, by using parallel computing technology, we can effectively improve computational efficiency.

As mentioned earlier, not only do you need to think of a high-performance algorithm, but it also needs to implement this algorithm. Then, how many codes does it need to implement this optimized algorithm in SPL? Only 50 lines in total! How about the effect? The full data of 5 million can be processed in 4 hours with 16 CPUs, and the overall speed is thousands of times faster than that of SQL solution.

Those who are attentive may have found that in this application case, SPL can achieve the user’s performance requirement with very few hardware resources (a single machine), and the distributed technology is unnecessary. This is what we advocate: maximize the performance of a single machine first, and then use distributed technology only when the hardware resource of a single machine is insufficient.

There are many similar cases where SPL implements the effect of a cluster with only a single machine. For example, in the multiple-concurrency account query scenario of mobile banking of a commercial bank, SPL implements the query efficiency with only one server, yet the efficiency originally needs a 6-node ElasticSearch cluster to implement, and at the same time, the problem of real-time association is solved. (For details, visit: Open-source SPL turns pre-association of query on bank mobile account into real-time association). Let’s see another case that in an e-commerce funnel computing scenario, it only took 29 seconds to get the result in SPL with 8 CPUs, while for the same computing task, the result was not obtained after 3 minutes on Snowflake’s Medium-level server (4-node cluster). For details, visit SQL Performance Enhancement: Conversion Funnel Analysis.

In addition to achieving the effect of a cluster with only a single machine, for tasks that originally run slowly on a single database, SPL can speed up by many times after making full use of the performance of a single machine. As a result, there is no need to turn to distributed technology anymore. For example, in the corporate loan business calculation of a bank, it took 1.5 hours in AIX+DB2, while it only took less than 10 minutes in SPL; the performance is improved by 10 times (for details, visit: Open-source SPL speeds up batch operating of bank loan agreements by 10+ times). In another case, in the batch job scenario of car insurance policies of a large insurance company, using SPL to replace the database speeds up the job from 2 hours to 17 minutes, a speedup of 7 times (For details, visit: Open-source SPL optimizes batch operating of insurance company from 2 hours to 17 minutes). There are many similar application cases. For those who are interested in such cases and principles, please visit How the performance improvement by orders of magnitude happened.

Of course, this article is not intended to oppose distributed technology, but to avoid the incorrect use of this technology. After giving full play to the performance of a single machine, using distributed technology in the case of insufficient hardware resource is a correct method for solving big data computing.

Moreover, SPL also provides perfect distributed computing functions and boasts corresponding load balancing and fault tolerance mechanisms. It provides different fault tolerance schemes (such as redundancy-pattern fault tolerance and spare-wheel-pattern fault tolerance) for different requirements and computing scenarios. It is worth mentioning that SPL cluster is targeted to small to medium cluster, with preferably no more than 32 nodes. Because SPL has extremely high computing performance and can effectively exploit hardware resources, such cluster sizes are enough in practice, and many scenarios can be handled on a single machine or at most several machines. Of course, if you face a few application scenarios that require a larger cluster, you need to choose other technologies.

In conclusion, the precondition of utilizing distributed technology is that a calculation task can be split easily, and more importantly, the performance of a single machine should be fully exploited before switching to distributed technology.