Competition of data processing languages on JVM: Kotlin, Scala and SPL
The open-source data processing languages based on JVM mainly include Kotlin, Scala and SPL. This article will compare them in many aspects to find out the one with the highest development efficiency. The applicable scenarios of this article are limited to common data processing and business logic in practice, placing emphasis on the structured data, not focusing on big data and high performance, nor involving special scenarios such as message flow and scientific computing.
Basic features
Applicability
Kotlin was originally designed to develop more efficient Java, and is applicable to any application scenario involved in Java. In addition to common MIS, Kotlin can also be used in WebServer and Android projects as well as game development, and hence it has relatively good universality. Scala was originally designed to be a general-purpose development language that integrates modern programming paradigms. In practice, Scala is mainly used for back-end big data processing, and rarely used for other types of projects, so its universality is not as good as Kotlin. The original design purpose of SPL was to be a professional data processing language, its actual application is consistent with this purpose. SPL is suitable for processing both the frontend/backend data and big/small data. Due to the relatively concentrated application scenarios, the universality of SPL is also not as good as that of Kotlin.
Programming paradigm
Kotlin focuses on object-oriented programming and also supports functional programming. Scala supports both paradigms. For the first paradigm, Scala is more thorough than Koltin, and for the second one, Scala is more convenient than Koltin. SPL cannot be regarded as supporting object-oriented programming. Although there is the object concept in SPL, it does not have related contents like inheritance and overloading. As for functional programming, SPL is also more convenient than Kotlin.
Operation mode
Both Kotlin and Scala are compiled languages, and SPL is an interpreted language. Although the interpreted language is more flexible, the performance will be slightly worse for the same code. However, since SPL has rich and efficient library functions, the overall performance is not inferior, and it is often more advantageous while processing big data.
External library
Kotlin can use all libraries of Java, but it lacks professional data processing libraries. Scala can also use all libraries of Java, and has a built-in professional big data processing library (Spark). SPL has built-in professional data processing functions, and provides a large number of basic operations with lower time complexity. Usually, there is no need for SPL to use external Java libraries, and they can be called in self-defined functions in special cases.
IDE and debugging
All the three languages have the graphical IDE and complete debugging function. The IDE of SPL is specially designed for data processing, and the structured data objects are presented in the form of table, and hence it is easier to observe. The IDEs of Kotlin and Scala are general-purpose type IDE, and not optimized for data processing, so it is inconvenient to observe structured data objects.
Learning difficulty
Kotlin is slightly more difficult to learn than Java, and those who are proficient in Java can easily learn Kotlin. Since the objective of Scala is to surpass Java, it is far more difficult to learn than Java. The objective of SPL is to simplify the coding of Java and even SQL, and many concepts are intentionally simplified, therefore, the difficulty of learning SPL is very low.
Amount of code
The original intention of Kotlin is to improve the development efficiency of Java. According to the official data, the comprehensive code amount of Kotlin is only 20% of that of Java, but the actual amount of code is not much reduced, probably due to the unprofessional nature of its data processing library. Scala has a lot of syntactic sugar, and its big data processing library is more professional, but the amount of code is much lower than that of Kotlin. SPL is only used for data processing and is the most professional. Moreover, since the interpreted language has the characteristic of strong expressive capability, the amount of code for the same task is much lower than the previous two languages (there will be comparison examples later). From this point of view, it can also indicate that it is less difficult to learn SPL.
Syntax
Data types
Atomic data type: all the three languages support this data type such as Short, Int, Long, Float, Double, Boolean
Date/time data type: Kotlin lacks an easy-to-use date/time data type, and generally uses that of Java. Both Scala and SPL have professional and convenient date/time data type.
Characteristic data type: Kotlin supports non-numeric character Char and null type Any?. Scala supports the tuple (fixed-length generic set), and has built-in BigDecimal. SPL supports high-performance multi-layer sequence number key and has built-in BigDecimal.
Set data type: both Kotlin and Scala support Set, List and Map. SPL supports the sequence (ordered generic set, similar to List).
Structured data type: Kotlin has the record set List<EntityBean>, but lacks the metadata, so it's not professional enough. Scala has professional structured data type, including Row, RDD, DataSet, DataFrame (this article will take this as an example to illustrate), etc. SPL has professional structured data type, including record, table sequence (this article will take this as an example to illustrate), internal table compression table, external storage Lazy cursor, etc.
Scala has the unique implicit conversion capability. Theoretically, it can convert between any data types (including parameters, variables, functions, classes), and can easily change or enhance original function.
Flow processing
Since the three languages all support basic sequential execution, branch judgment and loop, and any complex flow processing can be performed by each of them in theory, this regard will not be discussed much here. The following will focus on comparing the convenience of the loop structure for set data. Let’s take “calculating LRR” as an example, Kotlin code:
mData.forEachIndexed{index,it->
if(index>0) it.Mom= it.Amount/mData[index-1].Amount-1
}
The forEachIndexed function of Kotlin comes with the sequence number variables and member variables, and it is convenient to perform set loops. Since it supports taking records through index, the cross-row calculation can be easily performed. The disadvantage of Kotlin is that it needs to process the array out-of-bounds additionally.
Scala code:
val w = Window.orderBy(mData("SellerId"))
mData.withColumn("Mom", mData ("Amount")/lag(mData ("Amount"),1).over(w)-1)
The processing of array out-of-bounds is not required when using Scala to perform cross-row calculation, which is more convenient than Kotlin. However, Scala’s structured data objects do not support taking records through index, instead, it can only use the lag function to move the whole row, this is not convenient for structured data. The lag function cannot be used for the forEach with strong universality, the loop function with single functionality such as withColumn should be used instead. In order to maintain the base consistency of functional programming style and SQL style, the lag function must also work with the window function (this is not required for Python’s row-moving function), so the overall code looks more complicated than Kotlin.
SPL code:
mData.(Mom=Amount/Amount[-1]-1)
SPL has made a number of optimizations on the flow control of structured data objects. For the most general and commonly used loop functions like forEach, they can be directly expressed in parentheses in SPL, so the simplification has reached the extreme. SPL also has the row-moving function, but what is used in SPL is the more intuitive “[relative position]” syntax. When performing cross-row calculation, this syntax is more powerful than Kotlin's absolute positioning and more convenient than Scala's row-moving function. In addition to the above code, SPL has more flow processing functions for the structured data, such as: taking a batch of records rather than one record in each round of loop; looping one round when the value of a certain field changes.
Lambda expression
Lambda expression is the simple implementation of anonymous function, aiming to simplify the definition of function, especially the diverse set computing type function. Kotlin supports Lambda expression, but because Kotlin is a compiled language, it is difficult to easily specify the parameter expression as value parameter or function parameter. Therefore, complex interface rule has to be designed for the purpose of distinguishing, and more than this, Kotlin has the so-called dedicated interface for higher-order function, which leads to the difficulty of writing lambda expression in Kotin and the lack of professional nature in data processing. Let's look at a few examples:
"abcd".substring( 1,2) //value parameter
"abcd".sumBy{ it.toInt()} //function parameter
mData.forEachIndexed{ index,it-> if(index>0) it.Mom=…} //function of function parameter has multiple parameters
The lack of professional nature of Koltin’s Lambda expression is also reflected in the fact that the variable name (it) of the structured data object must be added when using the field, not like SQL where the table name can be omitted when calculating a single table.
Likewise, as the compiled language, Scala’s Lambda expression is not much different from that of Kotlin, and it also needs to design complex interface rule, and is also difficult to code, so the example will not be given here. When calculating LRR, the variable name of structured data object also needs to be placed before the field, or using the col function, like mData (“Amount”) or col(“Amount”). Although the syntactic sugar can be used to make up for this deficiency, writing as $“Amount” or ’Amount, many functions do not support such writing method, and if you insist on this method, it will make the style inconsistent.
Unlike the lambda expressions of Koltin and Scala, that of SPL is easy to use and more professional, which is related to the characteristics of SPL as an interpreted language. Using the interpreted language can infer the value parameter and function parameter easily, and there is no so-called complex higher-order function special interface, and all function interfaces are equally simple. Let’s look at a few examples:
mid("abcd",2,1) //value parameter
Orders.sum(Amount*Amount) //function parameter
mData.(Mom=Amount/Amount[-1]-1) //function of function parameter has multiple parameters
SPL can use the field name directly, and the variable name of structured data object is not needed, for example:
Orders.select(Amount>1000 && Amount<=3000 && like(Client,"*S*"))
Since most of SPL’s loop functions have default member variable ~ and sequence number variable #, it can significantly improve the convenience of code writing, and is especially suitable for structured data calculation. For example, take out the records of even positions:
Students.select(# % 2==0)
Find out the top 3 in each group:
Orders.group(SellerId;~.top(3;Amount))
SPL function options and cascaded parameter
It is worth mentioning that in order to further improve the development efficiency, SPL also provides the unique function syntax.
When there are a large number of functions with similar functionality, most programming languages can only use different names or parameters to distinguish, which is not convenient in use, whereas SPL provides very unique function option, which allow the functions with similar functionality to share one function name, and their difference can be distinguished just by the option. For example, the basic function of the select function is to filter, if you only want to filter out the first record that meets the conditions, you can use the option @1:
T.select@1(Amount>1000)
When using the binary search to quickly filter the sorted data, you can use the option @b:
T.select@b(Amount>1000)
The options of function can also be used in a combined way, for example:
Orders.select@1b(Amount>1000)
The parameters of some functions are very complex and may be divided into multiple layers. In view of this situation, conventional programming languages do not have a special syntax solution, and can only generate multi-layer structured data object and then pass them in, which is very troublesome. However, SQL uses the keywords to separate the parameters into multiple groups, which is more intuitive and simpler, but it will use a lot of keywords, making the statement structure inconsistent. SPL creatively invents the cascaded parameter, which simplifies the expression of complex parameter. It can divide the parameters into three layers from high to low by semicolons, commas and colons:
join(Orders:o,SellerId ; Employees:e,EId)
Data source
Types of data sources
Kotlin supports all data sources of Java in principle, but it is cumbersome to code, troublesome to convert data types and unstable, for the reason that Kotlin does not have the built-in data source access interface, not to mention the optimization for structured data processing (except for JDBC interface). In this sense, we can say that Kotlin does not directly support any data source, and can only use Java’s third-party libraries. Fortunately, the number of third-party libraries is large enough.
Scala supports many types of data sources, and has built in six data source interfaces, and optimized for the processing of structured data, including: JDBC, CSV, TXT, JSON, Parquet columnar storage format, ORC columnar storage. Although other data source interfaces are not built in, the third-party libraries developed by the community group can be used. Scala provides the data source interface specification, requiring the third-party libraries to export structured data objects. Some common third-party interfaces are XML, Cassandra, HBase, MongoDB, etc.
SPL has built in the largest number of data source interfaces, and optimized for the processing of structured data, including:
JDBC (i.e., all RDBs)
CSV, TXT, JSON, XML, Excel
HBase, HDFS, Hive, Spark
Salesforce, Alicloud
Restful, WebService, Webcrawl
Elasticsearch, MongoDB, Kafka, R2dbc, FTP
Cassandra, DynamoDB, influxDB, Redis, SAP
These data sources can be accessed directly, so it is very convenient. For other data sources that are not listed above, SPL provides interface specification, and thus, as long as such data sources are exported as the structured data objects of SPL according to the specification, subsequent calculations can be performed.
Comparison of codes
Let's take the standard CSV file as an example to compare the parsing codes of the three languages. Kotlin:
val file = File("D:\\data\\Orders.txt")
data class Order(var OrderID: Int,var Client: String,var SellerId: Int, var Amount: Double, var Orde
rDate: Date)
var sdf = SimpleDateFormat("yyyy-MM-dd")
var Orders=file.readLines().drop(1).map{
var l=it.split("\t")
var r=Order(l[0].toInt(),l[1],l[2].toInt(),l[3].toDouble(),sdf.parse(l[4]))
r
}
var resutl=Orders.filter{
it.Amount>= 1000 && it.Amount < 3000}
Koltin is not very professional and usually needs to hard code to read CSV, including defining the data structure in advance, manually parsing the data type in the loop function. Consequently, the overall code is quite cumbersome. Alternatively, it can use the libraries like OpenCSV to read CSV file. In this way, although the data type does not need to be parsed in the code, it needs to be defined in the configuration file, and the implementation process is not necessarily simple.
Scala is very professional. Since it has built in the interface for parsing CSV, its code is much shorter than that of Koltin:
val spark = SparkSession.builder().master("local").getOrCreate()
val Orders = spark.read.option("header", "true").option("sep","\t").option("inferSchema", "true").cs
v("D:/data/orders.csv").withColumn("OrderDate", col("OrderDate").cast(DateType))
Orders.filter("Amount>1000 and Amount<=3000")
Scala is a bit more cumbersome when parsing data types, but there are no obvious shortcomings in other aspects.
Compared with Scala, SPL is more professional, and parsing and calculation takes only one line of code:
T("D:/data/orders.csv").select(Amount>1000 && Amount<=3000)
Cross-source computing
JVM data processing language has strong openness and sufficient ability to perform the association, merge and set operations on different data sources. However, the difference in the professionalism for data processing leads to great differences in the convenience of different languages.
Kotlin is not professional enough, which lacks not only the built-in data source interface, but the cross-source calculation function, so it has to hard-code to implement. Assuming that the employee table and orders table have been obtained from different data sources, now we want to associate them, Kotlin code:
data class OrderNew(var OrderID:Int ,var Client:String, var SellerId:Employee ,var Amount:Double ,var OrderDate:Date )
val result = Orders.map { o->var emp=Employees.firstOrNull{ it.EId==o.SellerId
}
emp?.let{ OrderNew(o.OrderID,o.Client,emp,o.Amount,o.OrderDate)
}
}
.filter {o->o!=null}
It is easy to see the shortcomings of Kotlin from this code, that is, once the code is long, Lambda expression becomes difficult to read, and is not as easy to understand as ordinary code; the data structure after association needs to be defined in advance, which leads to poor flexibility, affecting the fluency of problem solving.
Scala is more professional than Kotlin. In Scala, not only a variety of data source interfaces are built in, but the cross-source computing functions are provided. For the same calculation task, Scala code is much simpler:
val join=Orders.join(Employees,Orders("SellerId")===Employees("EId"),"Inner")
From this code, we can see that Scala has the objects and functions specially used for the calculation of structured data, and works well with Lambda language. The code is easier to understand, and there is no need to define the data structure in advance.
Compared with Scala, SPL is more professional. Specifically, SPL has more professional structured data objects, more convenient cross-source calculation functions, and shorter code:
join(Orders:o,SellerId;Employees:e,EId)
Language's own storage format
For the intermediate data used repeatedly, it is usually saved as the local file in a certain format to improve the data fetching performance. Kotlin supports multiple file formats, and can theoretically store and recalculate intermediate data. However, since Kotlin is not professional in data processing, and a fairly large amount of code is required even for basic reading and writing operations, it means that Kotlin does not have its own storage format.
Scala supports a variety of storage formats, among which the parquet file is commonly used and easy to use. The parquet is an open-source storage format that supports columnar storage and can store a large amount of data, and the intermediate calculation result (DataFrame) can be easily converted to and from parquet file. Unfortunately, parquet's index is not yet mature.
val df = spark.read.parquet("input.parquet")
val result=df.groupBy(data("Dept"),data("Gender")).agg(sum("Amount"),count("*"))
result.write.parquet("output.parquet")
SPL supports two private binary storage formats, btx and ctx. The btx is simple row-based storage format, while the ctx supports row-based storage, columnar storage and index, and can store a large amount of data and perform high-performance computing. The intermediate calculation result (table sequence/cursor) can be easily converted to and from these two files.
A |
|
1 |
=file("input.ctx").open() |
2 |
=A1.cursor(Dept,Gender,Amount).groups(Dept,Gender;sum(Amount):amt,count(1):cnt) |
3 |
=file("output.ctx").create(#Dept,#Gender,amt,cnt).append(A2.cursor()) |
Structured data calculation
Structured data object
The core of data processing is calculation, especially the structured data calculation. The specialization level of structured data objects is decisive for the convenience of data processing.
Kotlin does not have the professional structured data object, and what is commonly used for structured data calculation is List<EntityBean>, where EntityBean can use data class to simplify the definition process.
List is an ordered set (repeatable). Any function involving member sequence number and set is well supported in Kotlin. For example, to access members by sequence number:
Orders[3] //take records by index, starting from 0
Orders.take(3) //first 3 records
Orders.slice(listOf(1,3,5)+IntRange(7,10)) //records with index 1, 3, 5, 7-10
It can also support taking members by sequence numbers from tail:
Orders.reversed().slice(1,3,5) //the 1st, 3rd and 5th records from last
Orders.take(1)+Orders.takeLast(1) //the 1st and last records
Any calculation involving order is relatively difficult. Kotlin supports set calculating in order, and it will be more convenient to perform relevant calculations. As one of set datatypes, the functions that List is good at also include the addition, deletion and modification, intersection, difference and sum, and splitting of set members. However, since List is not a professional structured data object, once the functions related to field structure are involved, it is difficult for Kotlin to achieve. For example, to take two fields in Orders to form a new structured data object:
data class CliAmt(var Client: String, var Amount: Double)
var CliAmts=Orders.map{it.let{CliAmt(it.Client,it.Amount) }}
The above function is very common, equivalent to a simple SQL statement select Client,Amount from Orders, but it is very cumbersome for Kotlin to code, for the reason that it needs to define the new structure in advance, and also hard-code to assign value to the field. Even the simple field-taking function is so cumbersome, not to mention more advanced functions such as: taking records by field sequence number, taking by parameter, obtaining a list of field names, modifying field structure, defining key and index on field, querying and computing by field.
Likewise, Scala has List, which is not much different from Kotlin, but Scala designs more specialized data object DataFrame (and RDD, DataSet) for structured data processing.
DataFrame is the structured data stream, and somewhat similar to database’s result set. Since both of them are unordered set, the data fetching through index is not supported, which can only be implemented in a disguised form. For example, to fetch the 10th record:
Orders.limit(10).tail(1)(0)
We can imagine that it is relatively troublesome for DataFrame to implement any calculation related to the order, such as interval, moving average, reverse sorting and so on. In addition to unordered data, DataFrame does not support modification (immutable feature). If you want to change data or structure, you have to generate a new DataFrame. For example, modifying field name is actually achieved by copying records:
Orders.selectExpr("Client as Cli")
DataFrame supports common set calculations, such as splitting, merging and cross-merging, where the union can be implemented through deduplication of set. But, because it is implemented by copying records, the performance of set calculation is generally not high.
Although there are many shortcomings, DataFrame is a professional structured data object, and its ability to access field is beyond the reach of Kotlin. For example, to get a list of metadata/field names:
Orders.schema.fields.map(it=>it.name).toList
It is also convenient to use field to fetch the data, for example, to fetch two fields to form a new Dataframe:
Orders.select("Client","Amount") //just using the field name works
Or, use the computed column to form a new DataFrame:
Orders.select(Orders("Client"),Orders("Amount")+1000) //just using the field name doesn't work
Unfortunately, DataFrame only supports the use of name in the form of string to reference field, and does not support the use of field sequence number or default name, resulting in inconveniences in many scenarios. In addition, DataFrame does not support defining the index, so high-performance random query cannot be implemented, and there are still shortcomings in its professionalism.
The structured data object of SPL is the table sequence, which has the advantages of being professional enough, easy to use, and strong expression ability. To access the members by sequence number:
Orders(3) //take records by index, starting from 1
Orders.to(3) //first 3 records
Orders.m(1,3,5,7:10) //records with sequence number 1, 3, 5, 7-10
The unique feature of taking records by sequence numbers from tail is that it supports negative signs to represent index from tail, which is more professional and convenient than Kotlin:
Orders.m(-1,-3,-5) //the 1st, 3rd and 5th records from last
Orders.m(1,-1) //the first and last records
As one of the set datatypes, the table sequence also supports the functions of adding, deleting and modifying, intersecting, merging, dividing and summing, and splitting the members of set. Since the table sequence is a mutable set like List, the discrete records are used as much as possible when performing the set calculating, instead of copying records, and hence the performance is much better than Scala, and the memory occupation is also less. The table sequence is a professional structured data object, in addition to set-related functions, more importantly, the field can be accessed easily. For example, to get a list of field names:
Orders.fname()
Take two fields to form a new table sequence:
Orders.new(Client,Amount)
Use the computed column to form a new table sequence:
Orders.new(Client,Amount*0.2)
Modify the field name:
Orders.alter(;OrderDate) //without copying records
Some scenarios need to use field sequence number or default name to access field, SPL provides corresponding access methods:
Orders.(Client) //take by field name (expression)
Orders.([#2,#3]) //take by default field name
Orders.field(“Client”) //take by string (external parameter)
Orders.field(2) //take by field sequence number
As a professional structured data object, the table sequence also supports defining the key and index on field:
Orders.keys@i(OrderID) //define the key and build the hash index at the same time
Orders.find(47) //high speed search through index
Calculation function
Kotlin supports part of basic calculation functions, including: filtering, sorting, deduplication, cross-merging of sets, various aggregations, grouping and aggregating. However, these functions are all based on ordinary sets. If the calculation target is changed to structured data objects, the calculation function library will appear to be very insufficient, and it usually needs to implement calculation with the aid of hard coding. There are also many basic set operations that Kotlin does not support, which can only be implemented by self-coding, including the association, window function, ranking, row-to-column conversion, merging, binary search, etc. In these set operations, the merging and binary search are order-related operations, it is not too difficult to implement such operations by self-coding as the List of Kotlin is an ordered set. In general, we can say that Kotlin's function library is weak for the structured data calculation.
Scala has relatively rich calculation functions, and they are all designed for structured data objects including the functions that Kotlin does not support such as: ranking, association, window function, row-to-column conversion. Basically, the calculation functions of Scala do not go beyond the scope of SQL. However, there are also some basic set operations that Scala does not support, especially those order-related operations like merging and binary search, and it is very difficult to implement such operations even if self-coding for the reason that the DataFrame of Scala follows the unordered data concept of SQL. Overall, Scala's function library is richer than that of Kotlin, but still lacks some basic operations.
Among these three languages, SPL has the richest calculation functions, and they are all designed for structured data objects. SPL greatly enriches the content of structured data operations, designs a lot of content beyond SQL, and of course functions that Scala/Kotlin does not support, such as the ordered computing: merging, binary search, taking record by interval, and sequence number of records that meet the conditions; In addition to the conventional equivalence grouping, SPL also supports enumeration grouping, alignment grouping, and ordered grouping; SPL divides association types into foreign key association and primary-sub table association; SPL supports using primary key to constrain data, and using index for fast query, supports the recursive query for multi-layer structured data (multi-table association or Json\XML).
Let's take grouping as an example. In addition to the conventional equivalence grouping, SPL also provides more grouping schemes:
Enumeration grouping: group by several conditional expressions, and records that meet the same condition are grouped into one group.
Alignment grouping: group by an external set. The records whose field values are equal to the members of the set are grouped into one group; the order of groups is consistent with the order of the set’s members; the empty group is allowed; the “records that do not belong to this set” can be put into one group separately.
Ordered grouping: group by the ordered field. For example, when the field changes or a certain condition is met, a new group is generated. SPL directly provides such ordered grouping, which can be performed by just adding an option to the conventional grouping function, and hence it’s very simple and the computing performance is better. Since other languages (including SQL) do not have this kind of grouping, they have to convert to conventional equivalence grouping or hard code to implement.
Let's take a few common examples to feel the differences of the three languages in the way of calculation function.
Sorting
To sort by Client in ascending order and Amount in descending order, Kotlin code:
Orders.sortedBy{it.Amount}.sortedByDescending{it.Client}
Although the code is not long, there are still inconveniences, for example: the ascending and descending orders use two different functions; the field name must be prefixed with the table name; the fields order written in the code is reverse to actual sorting order.
Scala:
Orders.orderBy(Orders("Client"),-Orders("Amount"))
This code is much simpler, the negative sign represents the descending order, and the fields order written in the code is the same as the sorting order. Unfortunately, the field still needs to attach a table name; the compiled language can only use string to achieve dynamic parsing of expression, resulting in inconsistent code style.
SPL:
Orders.sort(Client,-Amount)
Compared with Scala code, SPL code is simpler, the field does not need to attach table name, and it is easy for interpreted language to keep the code style consistent.
Grouping and aggregating
Kotlin:
data class Grp(var Dept:String,var Gender:String)
data class Agg(var sumAmount: Double,var rowCount:Int)
var result1=data.groupingBy{Grp(it!!.Dept,it.Gender)}
.fold(Agg(0.0,0),{acc, elem -> Agg(acc.sumAmount + elem!!.Amount,acc.rowCount+1)})
.toSortedMap(compareBy<Grp> { it.Dept }.thenBy { it.Gender })
We can see that this code is relatively cumbersome, which requires not only groupingby and fold functions, but hard coding to implement the calculation task. When a new data structure appears, it must be defined in advance, such as the grouped two-field structure and aggregated two-field structure, this will cause poor flexibility, and affect the fluency of problem solving. The final sorting is to keep consistent with the result order of other languages, which is not a must.
Scala:
val result=data.groupBy(data("Dept"),data("Gender")).agg(sum("Amount"),count("*"))
Scala code is much simpler, it is easier to understand, and doesn’t need to define data structures in advance.
SPL:
data.groups(Dept,Gender;sum(Amount),count(1))
SPL code is the simplest one, and its expression ability is no less than that of SQL.
Association calculation
There are two tables with fields of the same name. Now we want to associate them, and perform the grouping and aggregating operations. Kotlin code:
data class OrderNew(var OrderID:Int ,var Client:String, var SellerId:Employee ,var Amount:Double ,var OrderDate:Date )
val result = Orders.map { o->var emp=Employees.firstOrNull{it.EId==o.EId}
emp?.let{ OrderNew(o.OrderID,o.Client,emp,o.Amount,o.OrderDate)}
}
.filter {o->o!=null}
data class Grp(var Dept:String,var Gender:String)
data class Agg(var sumAmount: Double,var rowCount:Int)
var result1=data.groupingBy{Grp(it!!.EId.Dept,it.EId.Gender)}
.fold(Agg(0.0,0),{acc, elem -> Agg(acc.sumAmount + elem!!.Amount,acc.rowCount+1)})
.toSortedMap(compareBy<Grp> { it.Dept }.thenBy { it.Gender })
From this code, we can see that Kotlin code is very cumbersome, and new data structures have to be defined in many places, including the associated result, grouped two-field structure, and aggregated two-field structure.
Scala
val join=Orders.as("o").join(Employees.as("e"),Orders("EId")===Employees("EId"),"Inner")
val result= join.groupBy(join("e.Dept"),join("e.Gender")).agg(sum("o.Amount"),count("*"))
Scala is much simpler than Kolin, because there is no need to define the data structure, nor hard coding.
Coding in SPL is simpler compared with Scala:
join(Orders:o,SellerId;Employees:e,EId).groups(e.Dept,e.Gender;sum(o.Amount),count(1))
Comprehensive comparison of data processing
The content of a certain CSV file is not standardized, and every three lines corresponds to one record, of which the second line contains three fields (i.e., the set of sets). To rearrange the file into standardized structured data objects and sort by the 3rd and 4th fields:
Kotlin:
data class Order(var OrderID: Int,var Client: String,var SellerId: Int, var Amount: Double, var OrderDate: Date)
var Orders=ArrayList<Order>()
var sdf = SimpleDateFormat("yyyy-MM-dd")
var raw=File("d:\\threelines.txt").readLines()
raw.forEachIndexed{index,it->
if(index % 3==0) {
var f234=raw[index+1].split("\t")
var r=Order(raw[index].toInt(),f234[0],f234[1].toInt(),f234[2].toDouble(),
sdf.parse(raw[index+2]))
Orders.add(r)
}
}
var result=Orders.sortedByDescending{it.Amount}.sortedBy{it.SellerId}
Koltin is not very professional in data processing, and most of the functions need to be implemented by hard coding, including taking fields by position and from the set of sets.
Scala:
val raw=spark.read.text("D:/threelines.txt")
val rawrn=raw.withColumn("rn", monotonically_increasing_id())
var f1=rawrn.filter("rn % 3==0").withColumnRenamed("value","OrderId")
var f5=rawrn.filter("rn % 3==2").withColumnRenamed("value","OrderDate")
var f234=rawrn.filter("rn % 3==1")
.withColumn("splited",split(col("value"),"\t"))
.select(col("splited").getItem(0).as("Client")
,col("splited").getItem(1).as("SellerId")
,col("splited").getItem(2).as("Amount"))
f1.withColumn("rn1",monotonically_increasing_id())
f5=f5.withColumn("rn1",monotonically_increasing_id())
f234=f234.withColumn("rn1",monotonically_increasing_id())
var f=f1.join(f234,f1("rn1")===f234("rn1"))
.join(f5,f1("rn1")===f5("rn1"))
.select("OrderId","Client","SellerId","Amount","OrderDate")
val result=f.orderBy(col("SellerId"),-col("Amount"))
Scala is more professional in data processing, and uses a lot of structured data calculation functions instead of hard-writing the loop code. But Scala lacks the ability of ordered computing, and relevant functions usually need to be processed after adding a sequence number column, resulting in a long code.
SPL:
A |
|
1 |
=file("D:\\data.csv").import@si() |
2 |
=A1.group((#-1)\3) |
3 |
=A2.new(~(1):OrderID, (line=~(2).array("\t"))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate ) |
4 |
=A3.sort(SellerId,-Amount) |
SPL is the most professional language in data processing, and can achieve the calculation goal with only structured calculation functions. SPL supports ordered computing, and can group directly by position, take fields by position, and take fields from the set of sets. Although the implementation idea is similar to Scala, the code is much shorter.
Application framework
Java application integration
Kotline code will be compiled to bytecode, which can be easily called in Java just like ordinary class file. For example, the static method fun multi Lines(): List<Order> in Kotlin File.kt can be correctly recognized in Java and called directly:
java.util.List result=KotlinFileKt.multiLines();
result.forEach(e->{System.out.println(e);});
Scala code will also be compiled to bytecode, which can also be easily called in Java. For example, the static method def multiLines():DataFrame for ScalaObject object will be recognized as a Dataset type in Java, and called with a little modification:
org.apache.spark.sql.Dataset df=ScalaObject.multiLines();
df.show();
SPL provides the general JDBC interface, and simple SPL code can be directly embedded in Java like SQL:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/Orders.xls\").select(Amount>1000 && Amount<=3000 && like(Client,\"*s*\"))";
ResultSet result = statement.executeQuery(str);
Complex SPL code can be stored as a script file first, and then called in Java like a stored procedure, which can effectively reduce the coupling between computing code and front-end applications.
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call scriptFileName(?, ?)}");
statement.setObject(1, "2020-01-01");
statement.setObject(2, "2020-01-31");
statement.execute();
SPL is an interpreted language. After modifying, the code can be directly executed without compiling. Since SPL supports code hot swap, it can reduce maintenance workload and improve system stability. Both Kotlin and Scala are compiled languages, and the application must be restarted at an appropriate time after compiling.
Interactive command line
The interactive command line of Kotlin needs to be downloaded additionally and started with Kotlinc command. The command line of Kotlin can theoretically perform any complex data processing, but because its code is generally long and difficult to be modified in the command line, it is more suitable for simple numerical calculation:
>>>Math.sqrt(5.0)
2.236.6797749979
Scala’s interactive command line is built-in and started with the command of the same name. The command line of Scala can perform the data processing in theory, but because its code is relatively long, it is more suitable for simple numerical calculation:
scala>100*3
rest1: Int=300
SPL has built in interactive command line, which is stared with “esprocx -r -c” command. SPL code is generally short and allows for simple data processing in the command line.
(1): T("d:/Orders.txt").groups(SellerId;sum(Amount):amt).select(amt>2000)
(2):^C
D:\raqsoft64\esProc\bin>Log level:INFO
1 4263.900000000001
3 7624.599999999999
4 14128.599999999999
5 26942.4
By comparing in many aspects, we know that: for common data processing tasks in application development, the development efficiency of Kotlin is very low because it is not professional enough; Scala has a certain degree of professionalism, and its development efficiency is higher than Kotlin, but not as good as SPL; the development efficiency of SPL is much higher than Kotlin and Scala due to the more concise syntax, higher expression efficiency, more types of data sources, more easy-to-use interfaces, more professional structured data objects, richer functions and stronger computing power.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version