SPL Practice: Structured Text computing
Structured Text Computing Requirements
Structured text files are a common data storage method, such as this score.txt file, which records the scores of students in all classes. The first row is the column name, and each subsequent row is the data of a student, separated by tab within each row.
This structured text also has various computational and processing requirements. For example, selecting the girls with the English score of 90 or above, to calculate the total score of each student and sort them in descending order, to calculate the average Chinese language score for each class.
Importing into a database and using SQL to calculate is certainly not a problem, but the process is complicated and there may not always be a database available. Writing Java code is also very tedious, but using SPL is very simple.
Read structured text file
First, let's see how to load the file. The import() function of A1 can directly read the structured text file into an in-memory data table. The @t option represents the first row is field name. For comma separated CSV files, the option @c can be used to read in, and @tc represents reading the CSV text with the first row being column names.
Read score.txt and score.csv respectively in A1/A2. A3/A4 can also read score.txt and score.csv using the T() function, which can automatically load based on the file extension name.
There are other situations when reading text data, such as not having a title row, only reading partial columns, custom separators, and text with quotation marks and carriage returns. For details on these functions, please refer to the official documentation.
A |
|
1 |
=file("score.txt").import@t() |
2 |
=file("score.csv").import@tc() |
3 |
=T("score.txt") |
4 |
=T("score.csv") |
Try filtering
Let's first try the filtering operation, for example, to select girls who score 90 or above in English. A1 uses the T()function to load text data into an in-memory data table, and A2 uses the select() function to filter the data, it can be seen that only girls with scores equal to or above 90 remain in the result. The dot operator in A3 can combine loading data and filtering operation into one statement, with the same calculation result.
If this file is regarded as a table, the corresponding SQL statement for the operation is as follows:
SELECT * FROM score
WHERE English>=90 AND Gender=2
Sometimes it is desired to dynamically pass in the values of the filtering criteria. The default values can be predefined as 80 points and male students, which can be modified during runtime. The select function in A4 can use these two parameter variables.
In A5, use the export()function to write the result table sequence of A2 back into a text file named 'filter. txt'. Similar to the import() function, the @t option indicates that the first row will be written as a title.
A |
|
1 |
=T("score.txt") |
2 |
=A1.select(English>=90 && Gender==2) |
3 |
=T("score.txt").select(English>=90 && Gender==2) |
4 |
=T("score.txt").select(English>=argEng && Gender==argGen) |
5 |
=file("filter.txt").export@t(A2) |
Then try aggregation
Then try some aggregation operations. Calculate the highest and average scores for both girls and boys separately, which is consistent with the SQL aggregation functions.
A |
|
1 |
=T("score.txt") |
2 |
=A1.select(English>=90 && Gender==2) |
3 |
=T("score.txt").select(English>=90 && Gender==2) |
4 |
=T("score.txt").select(English>=argEng && Gender==argGen) |
5 |
=file("filter.txt").export@t(A2) |
SQL:
SELECT max(Math), avg(Math)
FROM score WHERE English>=90 AND Gender=2
SELECT max(Math), avg(Math)
FROM score WHERE English>=90 AND Gender=1
Integrate into Java applications
After the initial calculation attempts, let's take a look at how to integrate these codes into Java applications. SPL offers standard JDBC, which can provide data querying and computing capabilities to Java applications like executing SQL or calling stored procedures.
[JDBCConection].prepareStatement(SPL statement)
[JDBCConection].prepareCall(SPL script)
Import the jar files from the installation directory \ esProc \ lib into the application:
esproc-bin-xxxx.jar
icu4j-60.3.jar
Then copy raqsoftConfig. xml from the installation directory \ esProc \ config to the application's class path. raqsoftConfig. xml is the core configuration file of SPL, and its name cannot be changed.
JDBC executes SPL code
The driver class is InternalDriver, the connection URL is jdbc: esproc: local.
The created JDBC connection can execute SPL code like an SQL query. Load score.txt here to filter students with high scoring English. Use JDBC parameters to dynamically set the minimum score, with 90 set here. Run this Java program and, like other regular JDBC uses, return the result table sequence as a ResultSet object.
Class.forName("com.esproc.jdbc.InternalDriver");
Connection con= DriverManager.getConnection("jdbc:esproc:local://");
PreparedStatement ps = con.prepareStatement("=T(\"score.txt\").select(English >= ? )");
ps.setInt(1, 90);
ResultSet rs = ps.executeQuery();
JDBC calls SPL script
Save the SPL program as a .splx script file, here saved as scoreCalc.splx:
A |
|
1 |
=T("score.txt").select(English>=90) |
2 |
return A1 |
Then put the script file into the main directory of esProc, which is configured as the root directory of D drive in raqsoftConfig.xml.
Then you can call these scripts like using JDBC to call stored procedures:
PreparedStatement ps = con.prepareCall("call scoreCalc()");
ps.execute();
ResultSet rs = ps.getResultSet();
Executing call scoreCalc() calls and executes scoreCalc.splx. Return the result set to ResultSet.
JDBC executes SPL and returns single or multiple values
In addition to returning a table sequence of multiple rows and columns to the JDBC ResultSet object, it can also return a single value of one row and one column, or a sequence of multiple values of multiple rows and one column.
A |
|
1 |
=T("score.txt").sum(Math) |
2 |
return A1 |
A single value with 1 row and 1 column, with the column name fixed as "Field". For example, A2 returns the total math score of 70360.
A |
|
1 |
=T("score.txt").sum(Math) |
2 |
=T("score.txt").count(Math) |
3 |
=T("score.txt").avg(Math) |
4 |
=T("score.txt").min(Math) |
5 |
return [A1, A2, A3, A4] |
Multiple values with multi rows and 1 column, with the column name still being "Field", such as A5 returning the total score, number of people, average score, and lowest score in mathematics.
Using variable
Next, let's continue to look at more computing tasks and implement common SQL operations using SPL.
For the convenience of discussion, we assume that all A1 cells of the code are to load the file, and then assign the table sequence to the score variable. Afterwards, there is no need to write A1 anymore, only calculate on score.
In addition, for simplicity, SPL code no longer uses parameters, and you can modify it to have parameters by yourself.
A |
|
1 |
>score=T("score.txt") |
2 |
=score.select(English>=90 && Gender==2) |
3 |
…… |
Calculate columns
Calculate a new field using multiple field values of the same row, for example, calculate the total score and average score of each student's three subjects.
SPL uses the new() function to define calculated columns, and the expression in the calculated column can use the existing calculated columns, such as calculating total first and then using total to calculate avg.
A |
|
2 |
=score .new( Gender, Name ,(English+Chinese+Math):total ,round(total/3, 2):avg ) |
In the new() function, some original columns can also be selected and mixed with new computed columns or adjust the order of the columns. Here, the Gender is adjusted before the Name.
SQL:
SELECT
Gender, Name
,(English+Chinese+Math) as total
,(English+Chinese+Math)/3 as avg
FROM score
Add calculated columns
If you want to add new calculated columns directly to the existing table sequence, you can use the derive() function. Add the total and average scores of the three subjects here.
A |
|
2 |
=score .derive( (English+Chinese+Math):total ,round(total/3, 2):avg ) |
SQL:
SELECT
*
,(English+Chinese+Math) as total
,(English+Chinese+Math)/3 as avg
FROM score
Sort
Sort all students in descending order of total score, and for students with equal total scores, sort them in ascending order of math scores. SPL uses the sort() function to implement sorting, with a negative sign before the sorting field 'total' indicating descending order and 'Math' indicating ascending order.
A |
|
2 |
=score.derive( (English+Chinese+Math):total ).sort(-total, Math) |
In the results, it is observed that there are multiple students with a total score of 281, with students with lower math scores ranking higher.
Fields of string and date types also support sorting. For example, to sort in reverse order by Name will compare the characters one by one from the first character to the next.
SQL:
SELECT
(English+Chinese+Math) as total
,*
FROM score
ORDER BY
(English+Chinese+Math) DESC
,Math DESC
Group and aggregation
We want to group by class and gender, and calculate the number of people and average English scores. SPL uses the groups() function to implement grouping and aggregation. Define the grouping fields Class and Gender before the semicolon in parameter. When there are multiple grouping fields, separate them with commas; Define the aggregation expression after the semicolon, supporting aggregations such as sum, count, avg, min, max, etc.
A |
|
2 |
=score.groups( Class , Gender ; count(Name):cnt , round(avg(English),2):engAvg ) |
SQL:
SELECT
Class,Gender
, COUNT (Name) AS cnt
, ROUND(AVG (English),2) AS engAvg
FROM score
GROUP BY Class,Gender
Filter after grouping and aggregation
Before grouping, use the select() function to filter students who score 60 or above in mathematics;
After grouping and aggregation, use the select() function to filter groups with an average English score exceeding 70. SPL is defined and calculated step by step, and intermediate steps can be filtered at any time without using WHERE/HAVING to distinguish whether it occurred after the grouping.
A |
|
2 |
=score.select(Math>60) .groups( Class,Gender ;count(Name):cnt, round(avg(English),2):engAvg ).select(engAvg > 70) |
SQL:
SELECT
Class, Gender, COUNT (Name) AS cnt
, ROUND(AVG (English),2) AS engAvg
FROM score
WHERE Math>60
GROUP BY Class,Gender
HAVING ROUND(AVG (English),2) > 70
Remove duplicate values
In SPL, using the id() function can remove duplicate values. After removing duplicate names for one thousand students, we see 598 names.
A |
|
2 |
=score.id(Name) |
SQL:
SELECT DISTINCT Name FROM score
Multiple columns can also be combined to remove duplicates, which means that the values of multiple columns are considered duplicate only when they are all the same. For example, if a thousand students combine Name and Math to remove duplicates, there will be 992 left.
A |
|
2 |
=score.id([Name,Math] ) |
SQL:
SELECT DISTINCT Name,Math FROM score
Count distinct
In SPL, the icount () function is used to implement count distinct, resulting in 598 names after deduplication among 1000 students.
A |
|
2 |
=score.icount(Name) |
SQL:
SELECT COUNT(DISTINCT Name) Name FROM score
icount also supports multi column count distinct, with 992 remaining after deduplication for names and math scores that are all the same.
A |
|
2 |
=score.icount([Name,Math] ) |
SQL:
SELECT
COUNT(*)
FROM (
SELECT DISTINCT Name, Math FROM score
)
Count distinct within groups after grouping
The icount() function can also be used within the groups function, allowing for count distinct within the groups. There are 54 names for 56 people in Class 2 and 53 names for 55 people in Class 3.
A |
|
2 |
=score.groups( Class ; icount(Name):nameCnt , count(Name):cnt ) |
SQL:
SELECT
Class
, COUNT(distinct Name) AS nameCnt
, COUNT(Name) AS cnt
FROM score
GROUP BY Class
TOP-N
It is also common to calculate the top few records, and the top() function gets the top N records.The first parameter is a negative number, indicating reverse order. -5 will get the top 5 records with the highest total. If it is a positive number of 5, it will get the bottom 5 records with the lowest total.
When getting the top N, duplicate values will occupy the ranking by default, such as the 4th and 5th both scoring 281.
A |
|
2 |
=score.derive((English+Chinese+Math):total) .top(-5;total) |
SQL:
SELECT
TOP 5
*, (English+Chinese+Math) as total
FROM score
ORDER BY (English+Chinese+Math) DESC
If you want duplicate values not to occupy the ranking, you can use the @i option. Among the top 5, if the 4th has duplicate values and there is a total of 3 people, then a total of 7 people will be found in the top 5.
***.top@i(-5;total)
To find the maximum and minimum records can also use the maxp and minp functions, which are equivalent to top 1 and top -1, respectively.
Max ***.maxp(total) = top(1,total) :
Min ***.minp(total) = top(-1,total)
TOP-N within groups
To find out the top 5 students in each class, you can use the top function in the group() function,
A |
|
2 |
=score.derive( (English+Chinese+Math):total ).group(Class;~.top(-5;total)) |
The top 5 students from each class as shown in the figure are a series of subsets. Clicking on Class 3 will reveal the top 5 students from Class 3.
The groups()and group() functions are both used for grouping. Simply put, the former calculates some aggregation values directly after grouping, while the latter obtains subsets of each group after grouping. The subsets can be flexibly defined for more calculations and may not always be aggregated.
SQL:
SELECT
* , (English+Chinese+Math) AS total ,
RANK OVER(
PARTITION BY Class ORDER BY
English+Chinese+Math DESC
) AS rank
FROM score WHERE rank <= 5
More calculation functions:Date calculation
SPL provides many related calculation functions for different types of data.
Here listed some calculations based on the send and receive date fields, to obtain the year, month, day, and day of the week, to calculate how many days and weeks there is a difference between two dates, and to calculate the new date after a date.
Based on send (Sending Time) :2023-12-03 15:12:23,receive(Receiving Time):2024-03-15 11:12:30:
year(send) = 2023
month(send) = 12
day(send) = 3
day@w(send) = 1 , Sunday
interval(send, receive) = 103,103 days in between
interval@w(send, receive) = 14,14 weeks in between
elapse(send,50) = 2024-01-22 15:12:23,50 days after
……
More calculation functions:String calculation
There are also many calculation functions for string type data. Multiple strings can be concatenated together using concat, to determine if all characters are lowercase letters, to convert to lowercase, to truncate left, right, or middle substrings, to match strings with wildcard characters, to replace strings, and to randomly generate new strings of specific lengths using certain characters.
Based on first(First Name):John,last(Last Name):Smith to calculate :
concat(first,last) = JohnSmith
concat(first, " ", last) = John Smith
islower(last) = false
lower(last) = smith
left(last,3) = Smi
mid(last,2,4) = mith
like(last, "*th") = true
replace(last, "th", "??") = Smi??
rands(“abcdefg12345“,5) = 4d32d
……
More calculation functions:Convert to date type
SPL provides conversion functions between date and time, various numerical values, and string types. The target type is used as the function name, and the function automatically determines the original data type and correctly converts it to the target type. For example, the date() function can pass in integers such as year, month, day, string date, and can also specify the parsing format of the string date, ultimately obtaining a date.
date(1982 , 8 , 9) = 1982-08-09
date("1982-08-09") = 1982-08-09
date("1982-08-09 12:15:30") = 1982-08-09
date("08/09/1982" , "MM/dd/yyyy") = 1982-08-09
date("9, Aug 1982" , "d MMM yyyy" : "en") = 1982-08-09
datetime("1982-08-09 13:15:30") = 1982-08-09 13:15:30
datetime("9 Aug 1982 1:15 PM" ,
"d MMM yyyy h:mm a" : "en") = 1982-08-09 13:15:00
time("13:15:30") = 13:15:30
time("13/15/30” ,"HH/mm/ss") = 13:15:30
time("1:15 PM", "h:mm a" : "en") = 13:15:00
……
More calculation functions:Convert to numerical type
There are many numerical types, including integer int, long integer, double precision floating-point number float, real number, and large floating-point number decimal. In addition to converting between various numerical types, string and date types can also be converted to numerical values.
int("68") = 68
int("68.73") = 68
int(68.73) = 68
int(3.1*3.2) = 9
float("68.73") = 68.73
float(68) = 68.0
number("68f") = 68.0
number("$1,100.05", "$#,###.##") = 1100.05
long("68.73") = 68
long(now()) = 1727677043202
decimal(1232456523427854423905234234)
……
More calculation functions:Convert to string type
When converting date, time, and numerical values into strings, the format can also be specified.
string(123) = "123"
string(now(), "MMM dd, yyyy") = "Sep 30, 2024"
string(1100.05 , "$#,##0.00") = "$1,100.05"
string(datetime("1982-08-09 13:15:30"), "d MMM yyyy h:mm a" : "en") = "9 Aug 1982 1:15 PM“
…….
There are more calculation functions available, which can be viewed in detail in the official online documentation https://doc.scudata.com.
Types of associations
Let's take a look at operations involving multiple file associations.
SPL regards associations differently from SQL, as it distinguishes between foreign key associations and primary key associations.
For example, if the City foreign key in the student table is linked to the primary key CityID in the city table, this is a foreign key association. The ordinary fields Class and School in the student table are respectively linked to the primary keys ClassID and SchoolID in the class table. This two field and multi field association is also a foreign key association.
The student table is associated with the primary key StudentID of the student health table using the primary key, which is called primary key association. These two tables have a one-to-one relationship, also known as a homo-dimensional table.
The student table is associated with the student payment table's partial primary key StudentID using the primary key, which also belongs to primary key association. These two tables have a one to many relationship, also known as the primary sub tables.
Based on different association characteristics, SPL uses different functions to distinguish and implement associations.
Association operations involve multiple files. Similarly, in A1, multiple texts will be loaded as SPL table sequence variables to facilitate subsequent calculations.
A |
|
1 |
>student=T("student.txt"), city=T("city.txt"), class=T("class.txt"), stu_pay=T("stu_pay.txt"), stu_health=T("stu_health.txt") |
Foreign key association calculation
There are class teacher ID, English teacher ID, and other information in class.txt, while teacher.txt records the teacher's name, gender, and subject being taught. We need to find out the classes where the class teacher is a Chinese teacher, and it is necessary to associate the ClassTeacher in class.txt with the teacherId in teacher.txt.
SPL uses the switch()method to join two tables. During the calculation process of A2, four table sequences are involved, with class.txt loaded as the table sequence class and teacher.txt loaded as the table sequence teacher, then use the switch() function to replace ClassTeacher field value of the class with the relevant record of the teacher, and calculate the CT table sequence. The field values of the SPL table sequence can be complex data types such as records or even sub table sequences. CT is a multi-level nested table sequence.
It is not difficult to obtain data from nested sub records. For example, when CT2 with a simple structure is generated by CT new(), ClassTeacher.Subject obtains the subject taught by the class teacher, and finally uses the select() function to find the Chinese subject.
A |
|
2 |
>CT=class.switch( ClassTeacher ,teacher:teacherId ) ,CT2=CT.new( Class , ClassTeacher.Name , ClassTeacher.Subject ).select(Subject=="Chinese") |
SQL:
SELECT
C.Class ,T.Name ,T.Subject
FROM
class C JOIN teacher T
ON C.ClassTeacher=T.treachId
WHERE T.Subject="Chinese"
Multi level foreign key association
The FromCity field of the student table is foreign key to the city table, the ClassID field is foreign key to the class table, and the English, Chinese, and Mathematics teacher fields in the class table are all foreign key to the teacher table.
The student table has two foreign key fields, and the class table has three. There is also a multi-level association between the student table and the class table, and between the class table and the teacher table.
To query students from City4 and whose teacher is Teacher8, four tables need to be associated together. The switch() function supports associating multiple foreign key tables simultaneously, for example, in A2, first associate the class table with the three subject teachers, and then associate the city table and A2's class table in A3, and the result of A3 association is a more complex multi-level table sequence, which can visually display the association results.
Finally, in A4, use the new() function to extract the required information from the table sequence at each level and filter out students from City4 who are taught by Teacher8.
A |
|
2 |
=class.switch( EnglishTeacher, teacher:TeacherID ; ChineseTeacher, teacher:TeacherID ; MathTeacher, teacher:TeacherID ) |
3 |
=student.switch( ClassID, A2 : ClassID ; FromCity, city : CityID ) |
4 |
=B2.new(StudentID, Name ,FromCity.Name : City ,ClassID.EnglishTeacher.Name : EnTeacher ,ClassID.ChineseTeacher.Name : CnTeacher,ClassID.MathTeacher.Name : MaTeacher ).select(City=="City4" && (EnTeacher=="Teacher8" || CnTeacher=="Teacher8" || MaTeacher=="Teacher8")) |
SQL:
SELECT
s.StudentID, s.Name, c.Name AS city, en.Name AS EnTeacher,
cn.Name AS CnTeacher, ma.Name AS MaTeacher
FROM
student s JOIN city c JOIN class cl
JOIN teacher en JOIN teacher cn JOIN ma
ON
s.FromCity=c.CityID AND s.ClassID=cl.ClassID AND cl.EnglishTeacher=en.TeacherID
AND cl.ChineseTeacher=cn.TeacherID AND cl.MathTeacher=ma.TeacherID
WHERE
City="City4" AND
(EnTeacher="Teacher8" OR CnTeacher=“Teacher8” OR MaTeacher=“Teacher8”)
Multi field foreign key association
The switch()function is suitable for single field foreign key associations, and when multiple fields are associated, the join() function should be used.
Add a School column to the score table to record the scores of 30 classes in 5 schools, with 6 classes in each school. The class table records these 30 classes, with the primary keys being the Class and School fields. The classes are classified into Type 1, Type 2, and Type 3.
Now we need to find out the scores of all students in Type3 classes. In the parameters of the join() function, before the first comma are multiple foreign key fields, Class and School; Before the second comma is the associated table class, along with its multiple primary key fields Class and School; After that are the target field selected from the associated table, which can be multiple, and we only selected one ClassType field here.
The switch() function is for single field association that supports multiple tables, resulting in nested multi-level table sequences.
The join() function can support multi field association, but not multi table. The result is to directly join the fields of two tables to form a single-layer structured table sequence.
A |
|
2 |
>class=class.select(ClassType==3) |
3 |
>CT=score.join( Class:School ,class:Class:School ,ClassType ) |
SQL:
SELECT
s.*, c.ClassType
FROM
score s JOIN class c
ON s.Class=c.Class
AND s.School=c.School
WHERE c.ClassType=3
Homo-dimensional table - primary key association
The primary keys of the student table, family table, and health table are all Student IDs, and the three tables have a one-to-one homo-dimensional relationship. The latter two tables provide supplementary information on different aspects of students.
The primary key association requires the use of a new association function, such as join (A, B, C), which defines information for multiple associated tables in the parameters: the associated table (student), the field name(stu) in the table sequence after association, and the associated primary key (StudentID).
Observing the association result of A2, the three table records with Student ID 2 are associated together.
After association in A2, A3 filters out students and parents with myopia degree exceeding 500 degrees.
A |
|
2 |
=join( student : stu , StudentID; family : stu_family , StudentID; health : stu_health , StudentID ) |
3 |
=A2.new(stu.Name, stu_family.Father , stu_family.Mother, stu_health.EyeVision) .select(EyeVision>500) |
SQL:
SELECT
stu.Name, stu_family.Father
, stu_family.Mother, stu_health.EyeVision
FROM
student stu
JOIN family stu_family
JOIN health stu_health
ON stu.StudentID=stu_family.StudentID
AND stu.StudentID=stu_family.StudentID
WHERE sty_health.EyeVision>500
Primary sub table - primary key association
Another type of primary key association is the primary sub table, where the student payment table 'pay' is a sub table. It has two primary keys, Student ID and PayDate, and uses the Student ID as a foreign key to the primary table’s ‘student‘ primary key.
The primary sub table also use the join() function to associate. A student has 5 payment records, and each payment record is associated with the student record in the primary table.
After association in A2, A3 uses the groups() function to calculate the total payment amount for each student.
A |
|
2 |
=join( student : stu,StudentID; family : stu_family,StudentID; pay : stu_pay,StudentID ) |
3 |
=A2.groups( stu.Name, stu_family.Father; sum(stu_pay.PayMoney):allPay ) |
SQL:
SELECT
stu.Name, stu_family.Father,
sum(stu_pay.PayMoney):allPay
FROM
student stu
JOIN family stu_family
JOIN pay stu_pay
ON stu.StudentID=stu_family.StudentID
AND stu.StudentID=stu_pay.StudentID
GROUP BY stu.Name, stu_family.Father
And more…
The above is the content of structured text calculation in SPL, which provides a simple and powerful calculation solution for data outside of databases.
The examples involved here are all small data that can be read into memory. For files larger than memory, SPL also provides a cursor solution for calculation. Interested friends can refer to SPL's official documentation.
Of course, SPL's capabilities go far beyond that. SPL also supports various text formats such as JSON and XML, cross database queries, and supports mixed computing between databases and other non databases, supports for big data or complex computing, and SPL's computing power can also be used to optimize SQL performance. We will gradually introduce these contents in later topics.
SPL is an open source software that can be obtained from GitHub. https://github.com/SPLWare/esProc
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese Version