"Usually, we load text data into the database and then use SQL. Still, if the application only us .."

Jerry RaqForum 14 No.
1,515 View • 3 Years ago

Does Java have an open source package that can execute SQL on txt/csv?

Usually, we load text data into the database and then use SQL. Still, if the application only uses the data once and then discards it, it is unnecessary to load it into the database because it needs to build a table and then import the data. The process is cumbersome, and many temporary tables are wasteful of storage, and management is also troublesome. If it is only a single table query, some databases support text tables. That is, applications can query data in external files through the database. For example, h2 and HSQLDB can execute, SELECT * FROM CSVREAD(‘test.csv’);, but if you want To do an associative query with other tables, you still need to export test.csv into a normal internal table.

Another way is to use the Java calculation package and use SQL directly in Java, for example, CsvJdbc and Open-esProc. CsvJdbc is relatively simple. The data must be in CSV format or a fairly regular txt file. The specific usage is to execute SQL on the specified file through the CsvJdbc driver. Open-esProc is more powerful. First of all, Open-esProc is not a computing package that only supports SQL execution on txt/csv like CsvJdbc, but a comprehensive and professional computing package for processing structured data. The usage is different from the standard jar package. It encapsulates the data types and methods in a scripting language called SPL. It calculates the data by SPL first and then calls the SPL script in the Java program to return the ResultSet object. The usage is the same as CsvJdbc, and it’s just that SPL encapsulates the perfect method of querying file data with SQL.

The normal usage of Open-esProc is to write SQL statements in a special SPL IDE first, save the script separately, and then call it in a Java program. For example: Calculate the average Chinese score, the highest score in mathematics, and the total score in English for all students in the student report table. The SPL script is as follows:

	A
1	$select avg(Chinese),max(Math),sum(English) from E:/txt/Students_scores.txt

Script files (such as condition.dfx) are stored together with Java and called JAVA through the JDBC interface. The usage is similar to stored procedures.

…
 ResultSet result = statement.executeQuery("call condition.dfx");
…

SPL also supports SQL-like usage, without script files, directly embedding it in JAVA.

…
 ResultSet result = statement.executeQuery("
$select avg(Chinese),max(Math),sum(English) from E:/txt/Students_scores.txt");
…

SPL supports most of the syntax in the SQL92 standard, such as using the SQL ‘with …’ a clause to calculate data in text files.

Example: Find the HR, R&D, and Sales from the department data file, and then calculate the number of female employees and the average salary of these departments.

	A
1	$with A as (select NAME as DEPT from E:/txt/DEPARTMENT.txt where NAME='HR' or NAME='R&D' or NAME='Sales') select A.DEPT DEPT,count(*) NUM,avg(B.SALARY) AVG_SALARY from A left join E:/txt/EMPLOYEE.txt B on A.DEPT=B.DEPT where B.GENDER='F' group by A.DEPT

For more SQL examples in SPL, you can refer to SQL Query on File

Compared with CsvJdbc, Open-esProc can also easily support irregularly formatted text files, for example, when there is no column name.

	A
1	$select * from {file("sOrderNT.txt").import()} where _2='TAS'

It can be seen from SQL that when the data has no column name, the default column name is “underscore + serial number” , the column name of the second column is represented by _2, and the other columns and so on. {} is the SPL language, where the import function can read files with more complex formats.

To use the column names more intuitively, you can change the column names before calculating. The traditional SQL writing method changes the column name with a subquery, and the above SQL can be adjusted to

	A
1	$select * from ( select \_1 OrderID ,\_2 Client, \_3 SellerId, \_4 Amount, _5 OrderDate from {file("sOrderNT.txt").import()})

The separator of some files is not a common tab or comma but special symbols such as semicolon and colon. Sometimes multiple characters are used as the separator. In this case, the import function in the SPL statement should be used to specify the separator. SPL can also flexibly support special separators. For example, the separator of sep.txt is a double line ||, and you can use the following SQL to read.

	A
1	$select * from {file("sep.txt").import@t(;,"\|\|")}

There are some other weird situations. For example, some files do not use carriage return and line feed to distinguish records but use different symbols. For example, the entire file is a large string; some files have blank lines and need to be skipped before using them. The situation can be well supported, more examples for reference How to Use SQL in esProc

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

Promote