Data Analysis Programming from SQL to SPL: Airport Analysis

 

Data structure and sample data:

..

The table stores data for flights departing from and arriving at city airports on January 1, 2025.

In this table, FId is the flight number, AircraftId is the aircraft number, DepCity is the departure city, ArrCity is the arrival city, DepTime is the departure time, and ArrTime is the arrival time.

At 0:00 on that day,some aircraft were already parked in these cities. For example, the first record in the table indicates that the aircraft with AircraftId 1 was already in Washington at the start of the day. In this case, FId, DepCity, and DepTime are set to null, and ArrTime is set to 1970/1/1 0:0:0.

As per business rules, each flight number appears at most once within a single day.

1.Determine the maximum number of aircraft that stay in New York simultaneously

Throughout the day, some aircraft arrive in New York while others depart, causing the total number of aircraft staying in the city to constantly change. Now we need to calculate the maximum number of aircraft that stay in New York simultaneously.

Sort the flight data for arrivals and departures in New York by time, using the arrival time for arrivals and the departure time for departures. Then, mark arrival records as 1 and departure records as -1. Accumulating these marks from the beginning will obtain the count of aircraft present in New York corresponding to each record, as shown in the table below:

..

tm is the departure/arrival time, and chg is the arrival/departure flag. Loop through this table, summing the current row’s chg and the previous row’s cum to obtain the current row’s cumulative value cum, which is the number of aircraft currently staying in New York.

After the loop, taking the maximum cum will give the desired result.

SQL can use the window function sum to achieve cumulative summation.

with t1 as (
        select FId, DepTime tm, -1 chg
        from flights
        where DepCity='New York'
        union all
        select FId, ArrTime tm, 1 chg
        from flights
        where ArrCity=' New York '
),
t2 as (
        select *, sum(chg) over(order by tm,chg) cum
        from t1
)
select max(cum)
from t2;

SPL also offers a similar cumulative function to find the maximum value:


A

1

=T("flight.csv")

2

=A1.select(DepCity=="New York").derive(DepTime:tm,-1:chg)

3

=A1.select(ArrCity=="New York").derive(ArrTime:tm,1:chg)

4

=(A2|A3).sort(tm,chg).derive(cum(chg):cum).max(cum)

A2, A3: Calculate the data sets for flights departing from and arriving in New York respectively, and compute the tm and chg columns. Here, we use cells to store the two data sets, eliminating the need to define temporary variables.

SPL IDE is highly interactive, allowing for step-by-step execution, and intuitively viewing the result of each step in the right-hand panel at any time.

..

A4: Merge A2 and A3, sort the merged data, and then compute the cumulative value cum of chg.

The derive is a loop function that directly loops on the set, eliminating the need to write multi-line statement blocks like for...next. By using the iterative cumulative function cum in a loop, you can implement the approach described above and compute the cum column.

Finally, find the maximum value of the cum column.

The generation of the cum column is unnecessary if only its maximum value is required:


A

1

=T("flight.csv")

2

=A1.select(DepCity=="New York").new(DepTime:tm,-1:chg)

3

=A1.select(ArrCity=="New York").new(ArrTime:tm,1:chg)

4

=(A2|A3).sort(tm,chg).max(cum(chg))

2.Find the latest arriving flights when the number of aircraft in New York is at its peak

Assuming the maximum number of aircraft stayed in New York simultaneously is n, we want to find the data of the last n flights that arrived in New York before the latest time when there were n aircraft.

Using the method from Task 1 to compute the cum column, find the record that corresponds to the last maximum value of cum.

..

The FId of this record is 1048 and the cum is 41 (which is n). Sort the flight data arriving in New York by tm and then find the position of 1048 within the sorted data.

..

Taking n=41 records upward from this position is the expected result.

SQL’s aggregate function max can only return the maximum value and not the record containing the maximum value. To find the record containing the last maximum value of cum, you need to sort the data in descending order by cum and tm, and then take the first record.

SQL also does not directly support position-based record referencing. When retrieving 41 records preceding the record 1048, you can only use window functions to artificially create a sequence number, and then use that sequence number in an outer query to find the target records.

with t1 as (
    select DepTime tm, -1 chg
    from flights
    where DepCity='New York'
    union all
    select ArrTime tm, 1 chg
    from flights
    where ArrCity='New York'
),
t2 as (
    select tm, sum(chg) over(order by tm,chg) cum
    from t1
),
t3 as ( 
    select *
    from t2
    order by cum desc, tm desc
    limit 1
),
t4 as (
    select f.*,row_number() over(order by ArrTime desc) rn
    from flights f, t3
    where f.ArrCity='New York' and f.ArrTime<=t3.tm
)
select t4.*
from t4,t3
where rn<=t3.cum
order by ArrTime;

order by ArrTime;

The subqueries have increased from two in Task 1 to four, and because of the addition of window functions, the statement has become much more complex.

SPL supports position calculations, and its unique maxp function can return the record containing the maximum value, making it easy to implement the above approach.


A

1

=T("flight.csv")

2

=A1.select(DepCity=="New York").derive(DepTime:tm,-1:chg)

3

=A1.select(ArrCity=="New York").derive(ArrTime:tm,1:chg)

4

=(A2|A3).sort(tm,chg).derive(cum(chg):cum).maxp@z(cum)

5

=A3.sort(tm)

6

=A5.pselect(FId==A4.FId)

7

=A5.to(A6-A4.cum+1,A6)

A4: Replace the max from Task 1 with maxp@z to get the record containing the maximum value instead of just the maximum value itself. The @z option of maxp indicates that it finds the first member from the end, which in this case refers to the latest one in time.

A5: Sort the data of A3 by tm. A6: Find the position of FId value 1048 within A5. A7: Retrieve n records starting from that position and moving backwards.

3.Find the flights that arrived latest and stayed in New York when the number of aircraft in the city is at its peak

Assuming the maximum number of aircraft stayed in New York simultaneously is n, we want to find the data of the latest flight that arrived in New York and did not depart before the last time when there were n aircraft.

First, find the position of the last maximum value of cum, then create a new dataset containing all the records preceding that position.

..

Reverse the order of the records in the new dataset to get the following table:

..

Group this table by AircraftId. For example, for the aircraft with AircraftId of 24, the grouped subset is:

..

The times within each group of data are still sorted in descending order. The aircraft flew from New York to Washington and then back to New York, and did not leave New York before the last maximum value of cum appeared. It can be seen that if the chg of the first record in the group is 1, then this is a flight record that arrived in New York and did not depart.

Next, retrieve only the first record in each group and filter by the condition chg=1 to get the desired result.

SQL cannot directly retrieve the first record in a group. As a workaround, you need to use window functions to artificially create a sequence number within the group, and then use that sequence number in an outer query to find the target records.

with t1 as (
    select *, DepTime tm, -1 chg
    from plane
     where DepCity='New York'
     union all
     select *, ArrTime tm, 1 chg
     from plane
     where ArrCity='New York'
 ),
 t2 as (
     select *, sum(chg) over(order by tm,chg) cum
     from t1
 ),
 t3 as ( 
     select *
     from t2
     order by cum desc, tm desc
     limit 1
 ),
 t4 as (
     select t1.*,
         row_number() OVER(PARTITION BY AircraftId ORDER BY tm desc) AS rno
     from t1,t3
     where t1.tm<=t3.tm
 )
select * from t4 where chg=1 and rno=1 

SPL supports directly retrieving the first record in a group, making it easy to implement this approach.


A

B

1

=T("flight.csv")

2

=A1.select(DepCity=="New York").derive(DepTime:tm,-1:chg).sort(tm)

3

=A1.select(ArrCity=="New York").derive(ArrTime:tm,1:chg).sort(tm)

4

=[A2,A3].merge(tm,chg)

=A4.to(A4.pmax@z(cum(chg))).rvs()

5

=B4.group@1(AircraftId).select(chg==1)

A4: Ordered merge of A2 and A3. The calculation results can be seen in the preview panel on the left side of the IDE.

..

B4: Following the above approach, find the position of the last maximum value of cum, then reverse the set of all records before this position. Here, the pmax function is used instead of maxp from Task 2, to find the position index of the record containing the maximum value, rather than the record itself.

A5: The group@1 means to take the first record from each group. Finally, filtering these records with the condition chg equals 1 will produce the desired result.