"Question Source: [url] I have a dataset as follows: col1 extension_col1 2345 2246 2246 2134 2134 .."

blackduckie RaqForum 28 No.
1 Reply • 419 View • 2 Years ago

Spark SQL Find the Number of Extensions for a Record

sql(179) conditional grouping(9) row number(2)

Question

Source: https://stackoverflow.com/questions/70470272/spark-sql-find-the-number-of-extensions-for-a-record

I have a dataset as follows:

col1	extension_col1
2345	2246
2246	2134
2134	2091
2091	Null
1234	1111
1111	Null

I need to find the number of extensions available for each record in col1whereby records are sorted already and contiguously in terms of sets which are terminated by a null.

The final result will be like this:

col1	extension_col1	No_Of_Extensions
2345	2246	3
2246	2134	2
2134	2091	1
2091	Null	0
1234	1111	1
1111	Null	0

value 2345 extends as2345>2246>2134>2091>nulland hence it has 3 extension relations excluding null.

How to get the 3rd column(No_Of_Extensions) using spark sql/scala?

Answer

For this scenario where the data table’s first column is already ordered, create a new group each time when value of the second column in the previous record is null, and in each group, add a number column according to the specific requirements. It is a great hassle to try to achieve the process in SQL. You need to first create row numbers and the marker column as needed, and perform the grouping according to the marker column and row numbers. A common alternative is to fetch the original data out of the database and process it in Python or SPL. SPL, the open-source Java package, is easier to be integrated into a Java program and generate much simpler code. It expresses the algorithm with only two lines of code: