How to Handle Huge XLS Files
Excel displays poor performance in handling huge xls files. Usually we load an Excel file to the database and handle it using the database’s computing ability. Sometimes, however, the data can’t be wholly loaded to the database due to specific reasons. It would be great there was an application that can directly deal with massive xls files.
Take the employee information file emp.xls (Below is part of the source data):
EID |
NAME |
SURNAME |
GENDER |
STATE |
BIRTHDAY |
HIREDATE |
DEPT |
SALARY |
1 |
Rebecca |
Moore |
F |
California |
1974-11-20 |
2005-03-11 |
R&D |
7000 |
2 |
Ashley |
Wilson |
F |
New York |
1980-07-19 |
2008-03-16 |
Finance |
11000 |
3 |
Rachel |
Johnson |
F |
New Mexico |
1970-12-17 |
2010-12-01 |
Sales |
9000 |
4 |
Emily |
Smith |
F |
Texas |
1985-03-07 |
2006-08-15 |
HR |
7000 |
And the state information file states.xls as an example (Below is part of the data):
STATEID |
NAME |
POPULATION |
ABBR |
AREA |
CAPITAL |
REGIONID |
1 |
Alabama |
4779736 |
AL |
52419 |
Montgomery |
6 |
2 |
Alaska |
710231 |
AK |
663267 |
Juneau |
9 |
3 |
Arizona |
6392017 |
AZ |
113998 |
Phoenix |
8 |
4 |
Arkansas |
2915918 |
AR |
52897 |
Little Rock |
7 |
Task: Join the two tables through emp’s STATE column and states’ NAME column and get records where SALARY is above 5000 and POPULATION is below 5 million.
It’s easy to do this with esProc.
You can download esProc installation package and free DSK edition license HERE.
1. Get records of states where POPULATION is below one million:
Write script wherexls.dfx in esProc:
A |
|
1 |
$select * from states.xls where POPULATION<1000000 |
A1 gets records of states where POPULATION is below one million using simple SQL.
Below is A1’s result after execution:
2. Group emp records by genders and count employees in each group:
Write script groupxls.dfx in esProc:
A |
|
1 |
$select GENDER,count(*) as count from emp.xls group by GENDER |
A1 groups emp by gender and count employees in each group using simple SQL.
Below is A1’s result after execution:
3. Join emp table and states table through emp.STATE and states.NAME and select records where state population is below 5 million and employee salary is above 5000:
Write script joinxls.dfx in esProc:
A |
|
1 |
$select * from emp.xls b join states.xls a on a.NAME=b.STATE where a.POPULATION<5000000 and b.SALARY >5000 |
A1 performs join filtering over two tables using simple SQL.
Below is A1’s result after execution:
We can convert an xls file into a bin file. The bin file is esProc’s built-in binary file format. The format uses simple compression mechanism to store same size of data in smaller space and thus enables less time in reading it.
Take orders file orders.xls as an example. To convert it to a bin file, we use the following script:
A |
|
1 |
=file("orders.xls").xlsimport@t() |
2 |
=file("orders.btx").export@b(A1) |
Below is the disk space used by xls file and bin file:
Like an Excel file, a bin file can be directly handled with esProc. For example:
Group orders file orders.btx by year and list years when the total number of orders is less than 10,000:
Write script groupbtx.dfx in esProc:
A |
|
1 |
$select year(O_ORDERDATE) as year,count(*) as count from orders.btx group by year(O_ORDERDATE) having count(*)<10000 |
A1 groups records by year and count orders for each year using simple SQL.
Below is A1’s result after execution:
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_Desktop/
SPL Learning Material 👉 https://c.scudata.com
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProcDesktop
Linkedin Group 👉 https://www.linkedin.com/groups/14419406/