Spark SQL Find the Number of Extensions for a Record
Question
Source: https://stackoverflow.com/questions/70470272/spark-sql-find-the-number-of-extensions-for-a-record
I have a dataset as follows:
col1 |
extension_col1 |
2345 |
2246 |
2246 |
2134 |
2134 |
2091 |
2091 |
Null |
1234 |
1111 |
1111 |
Null |
I need to find the number of extensions available for each record in col1whereby records are sorted already and contiguously in terms of sets which are terminated by a null.
The final result will be like this:
col1 |
extension_col1 |
No_Of_Extensions |
2345 |
2246 |
3 |
2246 |
2134 |
2 |
2134 |
2091 |
1 |
2091 |
Null |
0 |
1234 |
1111 |
1 |
1111 |
Null |
0 |
value 2345 extends as2345>2246>2134>2091>nulland hence it has 3 extension relations excluding null.
How to get the 3rd column(No_Of_Extensions) using spark sql/scala?
Answer
For this scenario where the data table’s first column is already ordered, create a new group each time when value of the second column in the previous record is null, and in each group, add a number column according to the specific requirements. It is a great hassle to try to achieve the process in SQL. You need to first create row numbers and the marker column as needed, and perform the grouping according to the marker column and row numbers. A common alternative is to fetch the original data out of the database and process it in Python or SPL. SPL, the open-source Java package, is easier to be integrated into a Java program and generate much simpler code. It expresses the algorithm with only two lines of code:
A |
|
1 |
=MYSQL.query("select * from t4") |
2 |
=A1.group@i(#2[-1]==null).run(len=~.len(),~=~.derive(len-#:No_Of_Extensions)).conj() |
View SPL source code.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version