Find Missing Items within a Fixed Interval Time Series Stored in a CSV File
Problem description & analysis
Below is the content of CSV file csv.csv:
2020-07-29 00:00:00
2020-07-29 01:00:00
2020-07-29 02:00:00
2020-07-29 03:00:00
2020-07-29 04:00:00
2020-07-29 05:00:00
2020-07-29 06:00:00
2020-07-29 07:00:00
2020-07-29 08:00:00
2020-07-29 10:00:00
2020-07-29 11:00:00
2020-07-29 12:00:00
2020-07-29 13:00:00
2020-07-29 14:00:00
2020-07-29 15:00:00
2020-07-29 16:00:00
2020-07-29 17:00:00
2020-07-29 18:00:00
2020-07-29 19:00:00
2020-07-29 20:00:00
2020-07-29 22:00:00
2020-07-29 23:00:00
2020-07-30 00:00:00
2020-07-30 01:00:00
2020-07-30 02:00:00
2020-07-30 03:00:00
2020-07-30 04:00:00
2020-07-30 05:00:00
2020-07-30 06:00:00
2020-07-30 07:00:00
2020-07-30 08:00:00
2020-07-30 09:00:00
2020-07-30 10:00:00
2020-07-30 11:00:00
2020-07-30 12:00:00
The file records ordered data from 2020-07-29 00:00:00 to 020-07-30 12:00:00 at an interval of one hour. There should have been 24 records for each day, but actually missing data exists. We are trying to find the missing datetime records, as shown below:
2020-07-29 09:00:00
2020-07-29 21:00:00
Solutions
Method 1: Through difference
We write the following script p1.dfx in esProc:
A |
|
1 |
=file("csv.csv").import@i() |
2 |
=periods@s(A1(1),A1.m(-1),3600) |
3 |
=A2\A1 |
Explanation:
A1 Import the datetime records from the CSV file.
A2 List all datetimes from the first to last at an interval of 3600 seconds (one hour).
A3 Calculate the difference between A2 and A1.
Method 2: By traversal
We write the following script p1.dfx in esProc:
A |
|
1 |
=file("csv.csv").import@i() |
2 |
=A1.conj(if(#>1 && interval@s(~[-1],~)>3600, periods@xs(elapse@s(~[-1],3600),~,3600))) |
Explanation:
A1 Import the datetime records from the CSV file.
A2 Loop through each record of A1, during which, from the second datetime, if the interval between the previous datetime and the current one is greater than 3600 seconds (one hour), get every datetime that between the datetime 3600 seconds (one hour) after the previous datetime and the current datetime at an interval of 3600 seconds (one hour). @x option used in periods() function enables excluding the right end point.
See How to Call an SPL Script in Java to learn about the integration of an SPL script with a Java program.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL