How to merge Excel Files easily in Java
Most Java programmers are accustomed to using POI or HSSFWorkbook to auto-merge Excel files. That requires a lot of code and effort as well as strict Excel format, yet with low scalability. Imagine that there is an independent data computing tool. It enables you to write a SQL-style script simply and call it directly from a Java application and return the result set. That’s not imagination, however, because Raqsoft’s esProc has made it achievable. esProc provides JDBC interface to let a third application call an SPL (Structured Process Language) script conveniently. Here let’s look at how this works.
Basic merges
A. Merge multiple sheets in one Excel
The following Excel file stores sales data. It includes worksheets of three months:
January_2013:
February_2013:
March_2013:
We perform a filter merging by selecting Customer Name field and Sale Amount field from every sheet. Here’s the result:
SPL script:
A |
|
1 |
=file(”D:/sales_2013.xlsx”).xlsopen() |
2 |
=A1.conj(A1.xlsimport@t('Customer Name','Sale Amount';~.stname)) |
3 |
>file(“D:/result_2013.xlsx”).xlsexport@t(A2;"merge_sheets") |
Save the script as oneExcel.dfx to be embedded in the Java application.
Script explanation:
A1: Open the Excel file to generate a sequence of worksheets according to it.
A2: conj() function traverses A1’s members to import 'Customer Name' and 'Sale Amount' from each worksheet iteratively and concatenate them together. xlsimport() function imports specified columns where the last column is followed by semicolon; @t option tells the function to use the first row as field names. Parameter ~.stname points to the current worksheet.
A3: Save A2’s table sequence in a worksheet in the original Excel file and name it "merge_sheets" while using @t option to write the first record as column headers.
The three-liner is short, concise, clear and easy to understand.
B. Merge multiple Excels
Now we need to merge multiple Excels. Each Excel file includes multiple worksheets (their structures are the same as those in the previous example) where annual data is stored:
SPL script:
A |
B |
|
1 |
for directory@p(”d:/excel/*.xlsx“) |
=file(A1).xlsopen() |
2 |
=B1.conj(B1.xlsimport@t('Customer Name','SaleAmount','Purchase Date';~.stname)) |
|
3 |
=@|B2 |
|
4 |
>file(“d:/result.xlsx”).xlsexport@t(B3;"merge_data") |
Merging result:
Save the script as MergeExcels.dfx to be embedded in the Java application.
Script explanation:
A1: for statement traverses Excel files in the specified directory to process data in the inner loop in B1-B3.
B1: Open an Excel file to generate a sequence of worksheets.
B2: Import 'Customer Name', 'Sale Amount' and 'Purchase Date' columns from each worksheet of the current file and concatenate them together.
B3: Concatenate B2’s table sequence with the current cell value, which is represented by @.
A4: Save B3’s table sequence in a worksheet in result.xlsx and name it "merge_data".
The program uses a two-level loop to merge multiple Excel files. The outer loop traverses all Excel files under the specified directory and in the inner loop B1.conj merges data from multiple worksheets in each file.
C. Large merging result
In above examples, a merging result is loaded into the memory and then output at a time. If the result is too large or it exceeds the memory capacity, we can process and output data in a stream style.
SPL script:
A |
B |
|
1 |
=file("D:/out.xlsx") |
|
2 |
for directory@p(”d:/excel/*.xlsx“) |
=file(A2).xlsopen() |
3 |
=if(A1.exists(),B2.xlsimport@t(),B2.xlsimport()) |
|
4 |
>A1.xlsexport@s(B3;"merger") |
Merging result:
Save the script as BigExcel.dfx to be embedded in the Java application.
Script explanation:
A1: Open a specified output file.
A2: Traverse all to-be-merged Excel files in the directory.
B2: Open one of A2’s files.
B3: If the output file doesn’t exist, read in all rows of a worksheet in B2’s Excel file, including the column headers; if the output file exists, use @t option to read row 1 as the column headers and read detailed data from row 2.
B4: Append B3’s result to a worksheet named “merger” in A1’s output file in a stream style.
In this example, we traverse all Excel files to read in data from one and append it to the specified output file each time. The stream style output applies to merging a lot of small Excel files into a big one.
JAVA invocation
An SPL script is integration-friendly. The Java application calls an SPL script in same manner as calling a stored procedure via esProc JDBC. We’ll take OneExcel.dfx, the result of merging multiple worksheets in an Excel file, to illustrate the invocation process.
Connection con = null;
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");// Call stored procedure, where OneExcel is the dfx file name
st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call OneExcel()");//Execute stored procedure
st.execute();//Get result set
ResultSet rs = st.getResultSet();
...
Just change the file name to call another dfx file. See How to Call an SPL Script in Java to learn details. esProc also supports the ODBC style integration through same process.
Advanced merges
At times we need to group a merging result before outputting it.
Group & summarize a merging result
A. Group by field(s)
To group a merging result by one or more fields and perform aggregation, you can use the SPL script below:
A |
|
1 |
=file(”D:/sales_2013.xlsx”).xlsopen() |
2 |
=A1.conj(A1.xlsimport@t(;~.stname)) |
3 |
=A2.groups('Customer ID';sum('Sale Amount'):Total,avg('Sale Amount'):Average) |
4 |
=A2.groups('Customer ID','Purchase Date';sum('Sale Amount'):Total) |
A3’s result:
A4’s result:
Script explanation:
A1: Open the specified Excel file.
A2: Read in data from every worksheet and concatenate it.
A3: Group merging result by Customer ID and calculate total sales and average for each group.
A4: Group merging result by Customer ID & and Purchase Date and calculate total sales for each group.
B. Order-based grouping
esProc supports merge grouping, which groups records by comparing one with its next one. This increases efficiency and maintains the original order when records are already ordered. Suppose records are ordered by dates, we can use the following SPL script to perform grouping and aggregation, for example, by month:
SPL script:
A |
B |
|
1 |
for directory@p(”d:/excel/*.xlsx“) |
=file(A1).xlsopen() |
2 |
=B1.conj(B1.xlsimport@t(;~.stname)) |
|
3 |
=@|B2 |
|
4 |
=B3.derive(year('Purchase Date'):Year,month('Purchase Date'):Month) |
|
5 |
=A4.groups (month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average) |
|
6 |
=A4.groups@o(month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average) |
A5’s result:
A6’s result:
Script explanation:
A1~B3: As explained previously, the statements merge all worksheets of the same-structured Excel file under the specified directory.
A4: Create a table sequence based on B3’s table sequence and split the Purchase Date values to add Year field and Month field.
A5: groups() function groups merging result by Month and calculate total sales and average for each group.
A6: groups@o() function groups merging result by Year and Month and calculate total sales and average for each group. @o option enables a merge grouping by comparing a row with its next one.
C. Group by segment
To divide records into multiple segments according to specified intervals and perform aggregation:
SPL script:
A |
B |
|
1 |
for directory@p(”d:/excel/*.xlsx“) |
=file(A1).xlsopen() |
2 |
=B1.conj(B1.xlsimport@t(;~.stname)) |
|
3 |
=@|B2 |
|
4 |
=B3.groups(if ('Sale Amount'<1000,"1::<1000",if ('Sale Amount'<1500,"2::1000~~1500",if ('Sale Amount'<2000,"3::1500~~2000",if ('Sale Amount'<2500,"4::2000~~2500","5::>=2500")))):Segment;count(1):Number,sum('Sale Amount'):Total) |
Grouping result:
Script explanation:
A4: Divide records into 5 segments by intervals of sales amount and calculate record count and total sales for each group.
To change the grouping conditions we need to modify parameters in groups() function. And the parameter expression is too complicated. To make the script more concise, we rewrite it as follows:
A |
B |
|
1 |
[0,1000,1500,2000,2500] |
|
2 |
for directory@p(”d:/excel/*.xlsx“) |
=file(A1).xlsopen() |
3 |
=B1.conj(B1.xlsimport@t(;~.stname)) |
|
4 |
=@|B2 |
|
5 |
=B4.groups(A1.pseg(~.'Sale Amount'):Segment;count(1):Number,sum('Sale Amount'):Total) |
We can also perform grouping and aggregation by multiple fields or multiple conditions. For example, sum students’ scores according to ranges - Excellent, Very good, Good, Fair, Pass.
D. Large Excel merging
If the size of to-be-merged Excel files exceeds the physical memory capacity (reading them in with VBA almost can’t solve the problem), we can merge them at retrieval time in a stream style, i.e. retrieving a segment of data to process each time.
SPL script:
A |
B |
|
1 |
=file(“d:/tdata.xlsx”).xlsopen@r() |
|
2 |
for A1.count() |
=A1.xlsimport@ct(;A1(A2).stname) |
3 |
=@|B2 |
|
4 |
= B3.conjx() |
=A4.groups('Customer ID';sum('Sale Amount'):SaleTotal) |
5 |
>file(“d:/out.xlsx”).exportxls@bt(B4;"Customer&Sales") |
Result of filter grouping & aggregation:
Script explanation:
A1: @r opens the Excel in a stream style.
A2: Traverse worksheets in the Excel.
B2: @c option loads data with cursor.
B3: Concatenate cursor B2 with B3’s sequence of cursors.
A4: Concatenate records in B3’s cursors into a new cursor.
B4: Group records in A4’s cursor by Customer ID and sum Sale Amount for each group.
A5: Export result to a target Excel file.
Remove duplicates
In real-world scenarios, we often need to remove duplicate data to get correct analytic result.
A. By primary key
Remove duplicate data from sales_2013 by primary key Invoice Number field.
A |
|
1 |
=file(“d:/sales_2013.xlsx”).xlsopen() |
2 |
=A1.conj(A1.xlsimport@t('Customer Name', 'Invoice Number', 'Sale Amount';~. stname)) |
3 |
=A2.group@1('Invoice Number') |
4 |
>file(“d:/out.xlsx”).xlsexport@t(A3;"result") |
Result of merging Excel without duplicates:
Script explanation:
A1: Open the specified Excel.
A2: Import specified columns from each worksheet to concatenate them.
A3: Group A2’s table sequence by primary key Invoice Number and remove duplicates from each group. Parameter @1 (number 1 instead of letter 1) gets the first record from each group and concatenates them into a record sequence to return.
A4: Export result to a target Excel file.
Though data in every worksheet is unique, data in merging result may not be unique. That is why we remove duplicates by the primary key.
B. By a specified field
Remove duplicates from sales_2013 by a specified field to get records of different customers.
A |
|
1 |
=file(“d:/sales_2013.xlsx”).xlsopen() |
2 |
=A1.conj(A1.xlsimport@t('Customer ID', 'Customer Name';~. stname)) |
3 |
=A2.id('Customer Name') |
4 |
=A2.group@1('Customer Name') |
5 |
>file(“d:/out.xlsx”).xlsexport@t(A4;"result") |
Script explanation:
A1: Open the specified Excel.
A2: Import specified columns from each worksheet to concatenate them.
A3: Get different customer names from A2’s table sequence.
A4: Group A2’s table sequence by Customer Name and get the first record from each group to return.
A5: Export A4’s table sequence to a target Excel while writing row 1 as he column headers.
A3’s distinct result:
A4’s distinct result:
C. By multiple fields
Sometimes we need more than the primary key field to judge whether a record is a duplicate.
A |
|
1 |
=file(“d:/sales_2013.xlsx”).importxls@t() |
2 |
=file(“d:/sales_2014.xlsx”).importxls@t() |
3 |
=[A1,A2].merge('Customer ID', 'Purchase Date') |
4 |
=A3.group@1('Customer ID','Purchase Date') |
5 |
>file(“d:/out.xlsx”).xlsexport@t(A4;"result") |
Script explanation:
A1: Import data from the specified Excel.
A2: Import data from the specified Excel
A3: Merge A1’s table sequence and A2’s table sequence by Customer ID and Purchase Date and return a new table sequence.
A: Group A3 by Customer ID and Purchase Date and get the first record, i.e. the unique one, from each group.
A5: Export result to a target Excel.
D. Remove duplicate records from result
In each worksheet the records are unique but there may be duplicate records in the merging result.
A |
B |
|
1 |
=file(“d:/sales_2013.xlsx”).importxls@t() |
=A1.group@1('Invoice Number') |
2 |
=file(“d:/sales_2014.xlsx”).importxls@t() |
=A2.group@1('Invoice Number') |
3 |
=[B1,B2].merge@u() |
=A3.count() |
Script explanation:
A1: Import data from the specified Excel.
B1: Remove duplicate records from A1’s table sequence by Invoice Number.
A2,B2: Same as A1,B1.
A3: Merge B1’s table sequence and B2’s table sequence while removing duplicate records and return a new table sequence. @u option combines records of two table sequences in order and remove duplicates.
B3: Count records in A3’s table sequence.
merge@u() function merges ordered table sequences containing unique records.
Summary
To merge, group & aggregate and perform distinct over Excel files of same structure or different structure in SPL, esProc just reads each target field as set objects for further processing. It’s the same way of merging text files with esProc, the professional data computing tool.
SPL strengths
- A universal tool
Data computing in Java involves a mountain of work that generates unreusable, lengthy code. And most of the time the source data even doesn’t come from a database. SPL equalizes all types of source to process data in a uniform and convenient way.
- Free & open-source
The free esProc Community Edition has been released. To use it permanently, users can deploy it on the server and integrate it with Java.
- Solid technical support
You can find solutions to general problems in our Documentation. For users with esProc Basic Edition, you can ask any questions in Raqsoft Community, where we offer free technical supports.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL