String Parsing, Grouping & Writing to Multiple Files
【Question】
I have one large file that contains a bunch of weather data. I have to allocate each line from the large file into its corresponding state file. So there will be a total of 50 new state files with their own data.
The large file contains ~1 million lines of records like this:
COOP:166657,'NEW IBERIA AIRPORT ACADIANA REGIONAL LA US',200001,177,553
And names of the stations can vary and have different number of words.
Right now I am using regex to find the pattern and output to a file, and it must be grouped by state. If I read in the entire file without any modifications it takes about 46 seconds. With the code to find the state abbreviation, create the file, and output to that file, it takes over 10 minutes.
This is what I have right now:
package climate;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Arrays;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
\* This program will read in a large file containing many stations and states,
\* and output in order the stations to their corresponding state file.
*
\* Note: This takes a long time depending on processor. It also appends data to
\* the files so you must remove all the state files in the current directory
\* before running for accuracy.
*
\* @author Marcus
*
*/
public class ClimateCleanStates {
public static void main(String\[\] args) throws IOException {
Scanner in = new Scanner(System.in);
System.out
.println("Note: This program can take a long time depending on processor.");
System.out
.println("It is also not necessary to run as state files are in this directory.");
System.out
.println("But if you would like to see how it works, you may continue.");
System.out.println("Please remove state files before running.");
System.out.println("\\nIs the States directory empty?");
String answer = in.nextLine();
if (answer.equals("N")) {
System.exit(0);
in.close();
}
System.out.println("Would you like to run the program?");
String answer2 = in.nextLine();
if (answer2.equals("N")) {
System.exit(0);
in.close();
}
String\[\] statesSpaced = new String\[51\];
File statefile, dir, infile;
// Create files for each states
dir = new File("States");
dir.mkdir();
infile = new File("climatedata.csv");
FileReader fr = new FileReader(infile);
BufferedReader br = new BufferedReader(fr);
String line;
line = br.readLine();
System.out.println();
// Read in climatedata.csv
final long start = System.currentTimeMillis();
while ((line = br.readLine()) != null) {
// Remove instances of -9999
if (!line.contains("-9999")) {
String stateFileName = null;
Pattern p = Pattern.compile(".\* (\[A-Z\]\[A-Z\]) US");
Matcher m = p.matcher(line);
if (m.find()){
stateFileName = m.group(1);
stateFileName = "States/" \+ stateFileName + ".csv";
statefile = new File(stateFileName);
FileWriter stateWriter = new FileWriter(statefile, true);
stateWriter.write(line + "\\n");
// Progress reporting
//System.out.printf("Writing \[%s\] to file \[%s\]\\n", line,
// statefile);
stateWriter.flush();
stateWriter.close();
}
}
}
System.out.println("Elapsed" \+ (System.currentTimeMillis() - start)+ "ms");
br.close();
fr.close();
in.close();
}
}
【Answer】
Matching strings with a regular expression is slow. regex matches one record instead of processing a batch of at a time. It’s fast and simple to achieve your requirements in SPL (Structured Process Language). It takes just several seconds to get it done.
SPL script:
A |
|
1 |
=file("data.csv").import@is() |
2 |
=A1.group(mid(~,pos(~,"US'")-2,2):state;~:data) |
3 |
=A2.run(file("d:\\temp\\"+state+".cvs").export(data)) |
esProc provides JDBC interface to let a third-party program to call an SPL script in the way they call a database result set. To know more about the invocation, see How to Call an SPL Script in Java.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL