Looking for the Best Scripting Tool for Computing Time Series Data
【Abstract】
A scripting tool for handling time series data must have a good support of order-based computations. This article compares three scripting tools in terms of development efficiency, syntax expression ability and the support of structured computation function library, in order to find out how well each performs in handling sequence-number-based computations, relative-position-based computations and order-based set-operations. esProc SPL is the most efficient one among them. Looking for the Best Scripting Tool for Computing Time Series Data!
The time series data refers to the business data ordered by time. The handling of it involves quarters, months, workdays and weekends, and sometimes, complicated order-based calculations. So you need to choose a scripting language that has enough computing ability. The commonly-used scripting languages include SQL, Python Pandas and esProc SPL. Now let’s examine their capabilities one by one.
SQL
As the long-lasting and most widely-used language, SQL has been already at its peak within its model frame. Almost every simple operation has their SQL solution, including the order-based calculations of course.
The calculation of link relative ration is one example. The table stock1001 stores the transaction information of a stock. Transaction date (transDate) and closing price (price) are the two important fields. The task is to calculate the growth rate of each transaction date compared with the previous one.
This is a relative-position-based calculation. The SQL query is quite simple when using the window function:
select transDate,price,
price/lag(price) over(order by transDate)-1 comp
from stock1001
But the code will be roundabout if the SQL product doesn’t support window functions:
With A as(SELECT t1.transDate,t1.price, COUNT(*) AS rk
FROM stock1001 AS t1,
stock1001 AS t2
WHERE t1.transDate >t2.transDate or (t1.transDate=t2.transDate and t1.price<=t2.price)
GROUP BY t1.transDate, t1.price
ORDER BY rk)
select t1.transDate,(t1.price/t2.price-1) comp from A as t1 left join A as t2
on t1.rk=t2.rk+1
There are two reasons that the code is so roundabout. First, SQL is based on unordered sets and doesn’t have sequence numbers. It’s inconvenient to handle order-based calculations directly. The language has to invent a field of sequence numbers for such a calculation through self-join and grouping and aggregate operations. Second, SQL doesn’t have relative sequence numbers. In a roundabout way, it associates the previous row to the current row to achieve the relative-position calculation. Both make the code complicated.
Window functions improve the situation in some degree. When the computing goal isn’t simple, the code is still roundabout.
Here’s an example of calculating median (If the number of records (L) in a table is an even number, the median is the average of the two values in the middle; their sequence numbers are L/2 and L/2+1 respectively. If L is an odd number, the median is the value in the middle, whose sequence number is (L+1)/2). scores table stores student scores and has two main fields – student ID (studentid) and math scores (math). The task is to calculate the median of math scores.
The SQL query:
With A as (select studentdid,math, row_number() over (order by math) rk
from scores),
B as (select count(1) L from scores)
select avg(math) from A where rk in (
select case when mod(L,2)=1 then ((L+1) div 2) else ( L div 2) end no from B
union
select case when mod(L,2)=1 then ((L+1) div 2) else (L div 2)+1 end from B
))
The use of window function doesn’t make the SQL query simplified. Generating sequence numbers is redundant for the order-based set operation but it is necessary for the SQL solution. When explicit sequence numbers are required as in this example, the code is even more complex. It’s awkward to perform branch judgement in SQL, too. So the language uses a small trick to try to produce a simpler query, but this only make it more difficult to understand. In handling the case when L is an odd, it calculates the average of the two same middle values instead of returning the only middle value.
Using trunc function can skillfully avoid the judgement to simplify the code while calculating the median. But as the trick is far from the definition of the median, it may lead to great difficulty in understanding the code.
Getting the number of consecutively rising days is a more complicated case. Database table AAPL stores price information of a stock. The main fields are transaction date (transDate) and closing price (price). The task is to count the longest consecutively rising transaction days.
The SQL query:
select max(continue_inc_days)
from (select count(*) continue_inc_days
from (select sum(inc_de_flag) over(order by transDate) continue_de_days
from (select transDate,
case when
price>LAG(price) over(order by transDate)
then 0 else 1 end inc_de_flag
from AAPL) )
group by continue_de_days)
According to the intuitive way of thinking, we loop through the stock records ordered by date. In the process we add 1 to the number of consecutively rising days (the initial value is 0) if the current closing price is higher than the previous one; and if the current price is lower, compare the current total number of the consecutively rising days with the existing largest number (initial value is 0), get the new largest number and reset the current number as 0. Repeat the operations until all records are looped over. Then the current number is what we want.
SQL can’t implement the natural algorithm because of the lack of support for order-based calculations. So it resorts to an unusually trick. The trick divides the stock records ordered by date into groups, where records with consecutively rising prices are put into one group, that is the current record will be put into the same group with the previous one if its closing price rises, and those with consecutively decreasing prices are put into a new group. Then it counts the records in each group and finds the largest count, which is the desired result.
As SQL is already so hard to handle the slightly complicated scenarios, you can imagine its inability in dealing with more complicated cases. SQL’s awkwardness is rooted in unordered sets, its theoretical foundation. The intrinsic weak can’t be compensated by any updates or tricks.
Python Pandas
Pandas is Python’s function library for processing structured data. It is one of the common scripting tools to manipulate time series data.
As a library intended specifically for structured data computations, Pandas handles order-based calculations effortlessly. To calculate link relative ratio, for instance, Pandas query is as follows:
import pandas as pd
stock1001=pd.read_csv('d:/stock1001.csv') #return as a DataFrame
stock1001 ['comp'] = stock1001.math/ stock1001.shift(1).math-1
The first two lines import data from the file and the key code occupies only one line. Pandas, however, can’t express the concept of the previous row to implement a relative-position-based calculation directly. It uses shift(1) function to shift the column value one row down to implement the calculation in a roundabout way. for a Pandas beginner it’s easy to confuse with the row value and column value, the previous row and the next row.
As a young programming language, Pandas has more advanced methods in handling order-based calculations than SQL. That’s because Pandas is constructed on ordered sets. Its dataFrame data type has intrinsic sequence numbers that are suitable for handling order-based calculations. It’s easy to do the above tasks in Pandas.
To calculate median, Pandas has the following core code:
…
df=pd.read_csv('d:/scores.csv') #return as a DataFrame
math=df['math']
L=len(df)
if L % 2 == 1:
result= math[int(L / 2)]
else:
result= (math[int(L / 2 - 1)] + math[int(L / 2)]) / 2
print(result)
In the code Pandas uses [N] to represent a sequence number rather than inventing one specifically. That makes code simpler. Pandas is a procedural language. It does branch judgement in an easy way and doesn’t need a weird trick to simplify the code.
It’s also simple to calculate the largest number of rising days. Below is the Pandas key code:
…
aapl = pd.read_sql_query("select price from AAPL order by transDate", conn)
continue_inc_days=0 ; max_continue_inc_days=0
for i in aapl['price'].shift(0)>aapl['price'].shift(1):
continue_inc_days =0 if i==False else continue_inc_days +1
max_continue_inc_days = continue_inc_days if max_continue_inc_days < continue_inc_days else max_continue_inc_days
print(max_continue_inc_days)
conn.close()
Pandas has much stronger syntax expression ability. It is able to implement the natural algorithm without the need of showing off tricks.
Pandas, however, also can’t express a relative position directly. It shifts the column value one row down to represent the “previous row” in a hard to understand way.
Though more convenient to use than SQL, Pandas is still difficult to use in certain complex scenarios.
Finding the accumulated sales according to the specified condition is an example. sales table stores sales amounts of each client. The main fields include client (client) and sales amount (amount). The task is to find the first n big clients whose sales amount reaches half of the total and list them by amount in descending order.
The Pandas query:
import pandas as pd
sale_info = pd.read_csv("d:/sales.csv")
sale_info.sort_values(by=‘Amount’,inplace=True,ascending=False)
half_amount = sale_info[‘Amount’].sum()/2
vip_list = []
amount = 0
for client_info in sale_info.itertuples():
amount += getattr(client_info, ‘Amount’)
if amount < half_amount:
vip_list.append(getattr(client_info, ‘Client’))
else:
vip_list.append(getattr(client_info, ‘Client’))
break
print(vip_list)
Another scenario is to calculate the growth rate of each of the three day with the highest prices. stock1001 stores daily prices of a stock. Its main fields are transaction date (transDate) and closing price (price). The task is to list the three days with the highest prices in descending order and calculate the growth rate for each day.
The Pandas query:
import pandas as pd
stock1001 = pd.read_csv("d:/stock1001_price.txt",sep = ‘\t’)
CL = stock1001[‘CL’]
CL_psort = CL.argsort()[::-1].iloc[:3].values
CL_psort_shift1 = CL_psort-1
CL_rise = CL[CL_psort].values/CL[CL_psort_shift1].values-1
max_3 = stock1001.loc[CL_psort].reset_index(drop = True)
max_3[‘RISE’] = CL_rise
print(max_3)
In these more complicated scenarios, Pandas also turns to some difficult tricks. It makes the code hard to write and understand.
esProc SPL
eaProc boasts a wealth of functions for processing structured data. Besides its ordered-set base and the support of sequence numbers, the language offers convenient neighbor reference mechanism and abundant position-based functions to implement order-based calculations fast and efficiently.
SPL handles simple order-based calculations effortlessly. Here’s the SPL query for calculating link relative ratio:
A |
B |
|
1 |
=file("d:/stock1001.csv").import@tc() |
/ Import the CSV file |
2 |
=A1.derive(price/price[-1]-1:comp) |
/ Calculate the link relative ratio using the relative position reference mechanism |
A1 retrieves data from the CSV file. A2 does the key work. SPL uses the intuitive symbol [-1] to represent the previous row relative to the current one. This is a feature that both SQL and Pandas haven’t, and a reflection of SPL’s computing ability.
To calculate the median, SPL has the following key code:
A |
|
1 |
… |
2 |
=L=A1.len() |
3 |
=if(A2%2==0,A1([L/2,L/2+1]).avg(math),A1((L+1)/2).math) |
SPL uses [N] to represent a sequence number instead of inventing one. It is a procedural language. So it can use both if/else statement to implement a complicated branch and the if function, as in this case, to achieve a concise judgment.
It’s easier to calculate the longest consecutively rising days in SPL than in SQL/Pandas. Below is the SPL key code:
A |
|
1 |
… |
2 |
=a=0,A1.max(a=if(price>price[-1],a+1,0)) |
SPL implements the intuitive algorithm in an efficient way. With SPL, you can write a block of loop code using the loop statement, or use the loop function max to express the more concise loop aggregation as in this case.
SPL specializes in structured data processing. It can cope with even more complicated order-based calculations effortlessly.
To get certain running totals, for example, SPL has the following simple and short script:
A |
B |
|
1 |
=demo.query(“select client,amount from sales”).sort(amount:-1) |
Retrieve data and order it in descending order |
2 |
=A1.cumulate(amount) |
Get the sequence of running totals |
3 |
=A2.m(-1)/2 |
The final cumulative sum is the total |
4 |
=A2.pselect(~>=A3) |
Get the position where the cumulative value is half the total |
5 |
=A1(to(A4)) |
In an intuitive way, the SPL code calculates the cumulative sums for each client in A2, gets half of the total in A3, finds the position where the cumulative value is greater than A3, and gets the desired at the A4’s position. There are two SPL features in the script. One is the m()function used in A3. The function can get a value in a reversed order. -1 represents the last record. The other is pselect() function used in A4. The function returns a sequence number according to the specified condition. Both functions effectively simplify the code.
To calculate the growth rate for each of the three day with the highest prices, SPL uses the following script:
A |
B |
|
1 |
=file("d:/stock1001.csv").import@tc() |
/Retrieve data |
2 |
=A1.ptop(-3,price) |
/Get the positions of records containing the three days with the highest prices |
3 |
=A1.calc(A2,price/price[-1]-1) |
/Calculate the growth rates |
4 |
=A1(A2).new(transDate,price,A3(#):comp) |
/Create a two-dimensional table consisting of 3 fields |
The ptop()function in A2 finds the positions of the first N records. Similar to the pselect() function, it returns a sequence of sequence numbers instead of a sequence of records. There are many similar SPL functions that can be used to simplify the order-based calculations. The number sign # is another SPL feature. It represents the sequence number of a field value. It’s convenient because you don’t need to invent sequence numbers specifically as SQL requires or define an index as Pandas does.
SPL is a specialized language for processing structured data and has a great wealth of relative functions. These better enable it to handle the order-based calculations, even the very complicated scenarios, efficiently and effortlessly. In a nutshell, esProc SPL is the best choice in handling time series data.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version