Comparison of SQL & SPL: Recursion Operation

 

An recursion operation is defined as the process of directly or indirectly calling a procedure repeatedly. The Tower of Hanoi puzzle is one typical example of this. This essay introduces the methods and basic principles of SQL and SPL, the two common programming languages, in handling recursion scenarios, and finds the efficient and fast solutions for you through sample programs. Looking Comparison of SQL & SPL: Recursion Operation for details.

Ⅰ. Search for all references recursively

Example 1According to a company’s organizational structure below, get the hierarchical level of each department (headquarters are level 1, branches are level 2, and so on).

ID

ORG_NAME

PARENT_ID

1

Head Office

0

2

Beijing Branch Office

1

3

Shanghai Branch Office

1

4

Chengdu Branch Office

1

5

Beijing R&D Center

2

 

SQL solution:

To get this done, we need to loop through each record to find all superiors for each organization recursively. According to the natural way of thinking, the expected recursive process is that, for each record of each organization, we can locate the parent record through its superior’s ID and then the grandparent record through the corresponding ID, and so on. SQL does not have the concepts of explicit record and reference, so we can only get records having the superior-subordinate relationship recursively and sort them in a specific order, as shown below:

 

SQL offers WITH statement to do the recursion query, as shown below:

 

WITH CTE1 (ID,ORG_NAME,PARENT_ID,O_ID,O_ORG_NAME) AS(

   SELECT

      ID,ORG_NAME,PARENT_ID,ID O_ID,ORG_NAME O_ORG_NAME

   FROM ORGANIZATION

   UNION ALL

   SELECT

      ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID,CTE1.O_ID,CTE1.O_ORG_NAME

   FROM ORGANIZATION ORG

   INNER JOIN CTE1

   ON ORG.ID=CTE1.PARENT_ID

)

SELECT

   O_ID ID,MIN(O_ORG_NAME) ORG_NAME,COUNT(*) "LEVEL"

FROM (

   SELECT

      ID,ORG_NAME,PARENT_ID,O_ID,O_ORG_NAME

   FROM(

      SELECT *

      FROM CTE1

      ORDER BY O_ID,ID

   )

)

GROUP BY O_ID

ORDER BY ID

 

With Oracle, you can also use START WITH … CONNECT BY … PRIOR … to do the recursive query, as shown below:

 

SELECT

   MAX(ID) ID, MAX(ORG_NAME) ORG_NAME, COUNT(*) "LEVEL"

FROM (

   SELECT

      ID,ORG_NAME,SUM(ID) OVER (ORDER BY RN) GROUP_ID

   FROM (

      SELECT

         CASE WHEN LAG(PARENT_ID,1) OVER(ORDER BY RN ASC)=0

         OR LAG(PARENT_ID,1) OVER(ORDER BY RN ASC) IS NULL THEN ID

         ELSE NULL END ID,

         CASE WHEN LAG(PARENT_ID,1) OVER(ORDER BY RN ASC)=0

         OR LAG(PARENT_ID,1) OVER(ORDER BY RN ASC) IS NULL THEN ORG_NAME

         ELSE NULL END ORG_NAME,

         RN

      FROM (

         SELECT

            ID,ORG_NAME,PARENT_ID,ROWNUM RN

         FROM ORGANIZATION ORG

         START WITH ORG.PARENT_ID>=0

         CONNECT BY ORG.ID=PRIOR ORG.PARENT_ID

      )

   )

)

GROUP BY GROUP_ID

ORDER BY GROUP_ID

 

The START WITH …  statement does not generate simpler code. Though the program returns records of all superiors in turn for every organization to the result set, we need to number records in each group and compare each record with its neighboring one to get the initial position of a group because a SQL set is unordered. As SQL GROUP BY only supports the equi-grouping, we need to add group marks for a further grouping. It would be much simpler if the language supported conditional grouping (according to PARENT_ID=0, for instance) as SPL does.

 

With databases that neither support WITH nor START WITH …, SQL is unable to achieve recursive queries directly but it has to use the stored procedure to write an recursion function (The details will be skipped here).

 

SPL solution:

SPL provides A.prior(F) function to perform a recursive query to find all references by default.


A

1

=T("Organization.txt")

2

>A1.switch(PARENT_ID,A1:ID)

3

=A1.new(ID,ORG_NAME,~.prior(PARENT_ID).len():LEVEL)

A1: Import Organization table from the source file.

A2: Objectify the ID foreign key of the superior organization to convert it into the corresponding record and achieve self-join.

A3: Create a new table made up of ID, organization name and level. A.prior() function calculates the department level by getting the number of levels of records that a record references.

 

The SPL script of handling an recursion problem is concise because SPL’s recursive logic is natural. A.prior() function returns a set of all records of the parent organization for the current organization. Then it’s easy to perform count or any other operation, even a complicated one.

 

SPL supports retrieving a data table from the database, too. We can change A1 in the above script as follows:


A

1

=connect("db").query("SELECT * FROM ORGANIZATION")

 

Ⅱ. Search for all references recursively until a specific value appears

Example 2According to a company’s organizational structure below, get Beijing Branch Office’s subordinate organizations and its superiors’ names (separate multilevel organizations by comma).

ID

ORG_NAME

PARENT_ID

1

Head Office

0

2

Beijing Branch Office

1

3

Shanghai Branch Office

1

4

Chengdu Branch Office

1

5

Beijing R&D Center

2

 

SQL solution:

After the analysis of example 1, the first idea for doing this is that, during the recursive process for searching for the superiors for each organization, stop the recursion if the specified value (the Beijing Branch Office) is found, keep its records and filter away records of the organizations that cannot be found. It is difficult for SQL to hit the target with one round of recursion. So we divide the task into two steps. First, find all subordinate organizations of Beijing Branch Office; second, as the solution to the above problem does, find all superior organizations for each of those organizations until Beijing Branch Office appears. Below is the SQL queries:

 

WITH CTE1 (ID,ORG_NAME,PARENT_ID) AS(

   SELECT

      ID,ORG_NAME,PARENT_ID

   FROM ORGANIZATION

   WHERE ORG_NAME='Beijing Branch Office'

   UNION ALL

   SELECT

      ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID

   FROM ORGANIZATION ORG

   INNER JOIN CTE1

   ON ORG.PARENT_ID=CTE1.ID

)

,CTE2 (ID,ORG_NAME,PARENT_ID,O_ID,GROUP_NUM) AS(

   SELECT

      ID,ORG_NAME,PARENT_ID,ID O_ID,1 GROUP_NUM

   FROM CTE1

   UNION ALL

   SELECT ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID,

      CTE2.O_ID,CTE2.GROUP_NUM+1 GROUP_NUM

   FROM ORGANIZATION ORG

   INNER JOIN CTE2

   ON ORG.ID=CTE2.PARENT_ID AND

   CTE2.ORG_NAME<>'Beijing Branch Office'

)

SELECT

   MAX(O_ID) ID, MAX(O_ORG_NAME) ORG_NAME,

   MAX(PARENT_NAME) PARENT_NAME

FROM(

   SELECT

      O_ID, O_ORG_NAME,

      WM_CONCAT(ORG_NAME) OVER

      (PARTITION BY O_ID ORDER BY O_ID,GROUP_NUM) PARENT_NAME

   FROM(

      SELECT

         ID,PARENT_ID,O_ID,GROUP_NUM,

         CASE WHEN GROUP_NUM=1 THEN NULL ELSE ORG_NAME END ORG_NAME,

         CASE WHEN GROUP_NUM=1 THEN ORG_NAME ELSE NULL END O_ORG_NAME

      FROM (

         SELECT

            ID,ORG_NAME,PARENT_ID,O_ID,GROUP_NUM,ROWNUM RN

         FROM CTE2

         ORDER BY O_ID,GROUP_NUM

      )

   )

)

GROUP BY O_ID

ORDER BY O_ID

 

With Oracle, you can also use START WITH … CONNECT BY … PRIOR … to do the recursive query, as shown below:

 

WITH CTE1 AS (

   SELECT ID,ORG_NAME,PARENT_ID,ROWNUM RN

   FROM ORGANIZATION ORG

   START WITH ORG.ORG_NAME='Beijing Branch Office'

   CONNECT BY ORG.PARENT_ID=PRIOR ORG.ID

)

,CTE2 AS (

   SELECT

      ID,ORG_NAME,PARENT_ID,RN,

      CASE WHEN LAG(ORG_NAME,1) OVER(ORDER BY RN ASC)= 'Beijing Branch Office' OR

      LAG(ORG_NAME,1) OVER(ORDER BY RN ASC) IS NULL THEN 1 ELSE 0 END FLAG

   FROM(

      SELECT ID,ORG_NAME,PARENT_ID,ROWNUM RN

      FROM CTE1

      START WITH 1=1

      CONNECT BY CTE1.ID=PRIOR CTE1.PARENT_ID

   )

)

SELECT

   MAX(ID) ID, MAX(O_ORG_NAME) ORG_NAME,

   MAX(PARENT_NAME) PARENT_NAME

FROM(

   SELECT

      ID,O_ORG_NAME,GROUP_ID,

      WM_CONCAT(ORG_NAME) OVER

      (PARTITION BY GROUP_ID ORDER BY RN) PARENT_NAME

   FROM(

      SELECT

         ID, ORG_NAME, O_ORG_NAME,RN,

         SUM(ID) OVER (ORDER BY RN) GROUP_ID

      FROM(

         SELECT

            PARENT_ID,RN,

            CASE WHEN FLAG=1 THEN NULL ELSE ORG_NAME END ORG_NAME,

            CASE WHEN FLAG=1 THEN ID ELSE NULL END ID,

            CASE WHEN FLAG=1 THEN ORG_NAME ELSE NULL END O_ORG_NAME

         FROM CTE2

      )

   )

)

GROUP BY GROUP_ID

ORDER BY GROUP_ID

 

SPL solution:

SPL offers A.prior(F,r) function to do the recursive query to find the refences until a specific value appears:


A

1

=T("Organization.txt")

2

>A1.switch(PARENT_ID,A1:ID)

3

=A1.select@1(ORG_NAME=="Beijing Branch Office")

4

=A1.new(ID,ORG_NAME,~.prior(PARENT_ID,A3):PARENT_NAME)

5

=A4.select(PARENT_NAME!=null)

6

=A5.run(PARENT_NAME=PARENT_NAME.(PARENT_ID.ORG_NAME).concat@c())

A1: Import the Organization table.

A2: Objectify the ID foreign key of the superior organization to convert it into the corresponding record and achieve foreign key objectification.

A3: Get records of Beijing Branch Office.

A4: Create a new table made up of ID, organization name and the set of records of all superior organizations.

A5: Get records that has at least one superior organization, that is, those of Beijing Branch Office.

A6: Join up names of superior organizations into a comma-separated string in a circular way.

 

The SPL solution is logically clear and concise. It uses records of Beijing Branch Office as parameters to do the recursive search of references.

 

Ⅲ. Search for leaf records recursively

Example 3According to the following Chinese administrative divisions table, find regions and counties under Hebei province. Below is part of the source table:

ID

NAME

PARENT_ID

1

China

0

11

Beijing

1

12

Tianjin

1

13

Hebei

1

1301

Shijiazhuang

13

1302

Tangshan

13

 

SQL solution:

 

First, we find all records of Hebei province and then find records that are not superior regions of other regions. Below are the SQL queries:

 

WITH CTE1 (ID,NAME,PARENT_ID,PARENT_NAME) AS(

   SELECT

      ID,NAME,PARENT_ID,NULL PARENT_NAME

   FROM CHINA_REGION

   WHERE NAME='Hebei'

   UNION ALL

   SELECT

      T.ID,T.NAME,T.PARENT_ID,CTE1.NAME PARENT_NAME

   FROM CHINA_REGION T

   INNER JOIN CTE1

   ON T.PARENT_ID=CTE1.ID

)

SELECT ID,NAME,PARENT_NAME

FROM CTE1 T1

WHERE NOT EXISTS(

   SELECT 1

   FROM CTE1 T2

   WHERE T1.ID=T2.PARENT_ID

)

ORDER BY ID

 

SPL solution:

SPL has P.nodes@d(F,r) function to find all leaves recursively:


A

1

=T("ChinaRegion.csv")

2

>A1.switch(PARENT_ID,A1:ID)

3

=A1.select@1(NAME=="Hebei")

4

=A1.nodes@d(PARENT_ID,A3)

5

=A4.new(ID,NAME,PARENT_ID.NAME:PARENT_NAME)

A1: Import ChinaRegion table.

A2: Objectify the ID foreign key of the superior region to convert it into the corresponding record and achieve foreign key objectification.

A3: Get records of Hebei province.

A4: Use P.nodes() function to perform the recursive search; @d option is used to recursively find all leaves.

A5: Create a new table made up of ID, region names and names of superior regions.

 

IV. Traverse all files under a directory

Example 4According to the following terminal device use for online teaching in a primary school, find the proportion of each type of terminal. Below is the directory structure:

Files under the directory is of Excel format. Below is part of the source data:

ID

STUDENT_NAME

TERMINAL

1

Rebecca Moore

Phone

2

Ashley Wilson

Phone,PC,Pad

3

Rachel Johnson

Phone,PC,Pad

4

Emily Smith

PC,Pad

5

Ashley Smith

PC

6

Matthew Johnson

Phone

7

Alexis Smith

Phone,PC

8

Megan Wilson

Phone,PC,Pad

 

Since SQL cannot retrieve a directory and the files under it, a SQL solution is thus impossible. Let’s look at how SPL handles the recursive traversal of directories and files.

SPL solution:

SPL offers directory@s() function to search for names of all files under all subdirectories in an recursive way:


A

1

=directory@ps("C:/Primary School")

2

>A1.run(t=T(~),@+=t.len(),~=t.conj(TERMINAL.split@c()))

3

=A1.conj().groups(~:TERMINAL;count(~)/A2:PERCENTAGE)

A1Use directory() function to import the file directory. @s option is used to get file names recursively under all subdirectories.

A2Read in files circularly, split each string of terminals by comma, and concatenate all unique terminals.

A3Group records by terminal and calculate the proportion for each.

 

V. Concatenate field values in a JSON file recursively

Example 5Below is the JSON data of worldwide new COVID-19 confirmed cases in a certain time point. The task is to get the total confirmed cases across the world. Below is part of the source data:

 

[

   {Region:"USA",Confirmed:[

      {Region:"California",Confirmed:3671902},

      {Region:"New York",Confirmed:1139740},

      {Region:"Virginia",Confirmed:620801},

      {Region:"Florida",Confirmed:2064525},

      …

   ]},

   {Region:"Brazil",Confirmed:[…]},

   {Region:"India",Confirmed: […]},

   {Region:"France",Confirmed: […]}

   …

]

 

As SQL cannot retrieve a JSON file, we will skip the SQL solution. Below is the SPL way of handling the recursive traversal of a JSON file.

 

SPL solution:

SPL offers A.field@r()function to get field values recursively, and A.conj@r() function to concatenate members of a sequence recursively. Below is the SPL script:


A

1

=json(file("COVID-19.json").read())

2

=A1.field@r("Confirmed")

3

=A2.conj@r().sum()

A1: Read in the JSON file.

A2: GET ALL Confirmed field values recursively.

A3: Concatenate all numbers of confirmed cases and sum them.

 

Summary

SQL has its method of achieving recursion problems, but the method not convenient to use. This is because SQL does not have the concepts of explicit records and reference, and row numbers should be used in the result of an recursive operation to mark the superior-subordinate relationship. Besides, SQL is based on unordered sets, which makes it troublesome to perform subsequent operations after recursions.

SPL supports the data structure of different data types and offers the concept of explicit records. More importantly, SPL’s logic in handle recursion problems is clear. It uses an recursion function after the self-join and then a set operation to finish the job. SPL supplies flexible and adaptable recursion features to handle common recursion scenarios, such as directory recursion and file content recursion.

 With a complex query, the complexity of SQL solution increases sharply by resorting to temporary table and nested queries. This makes it hard to write and maintain the SQL statements. SPL, however, organizes an algorithm in a natural way of thinking and generates concise code.

The SPL-driven and ordered-set-based esProc is the professional data computation engine that provides a complete set of set-oriented operations. The tool combines merits of both Java and SQL and makes it easy to handle recursion scenarios.

Organization.txt

ChinaRegion.csv