"It is a common operation to extract a subsequence from a sequence (that is, to extract a subset .."

mars RaqForum 25 No.
441 View • 3 Years ago

SPL Programming - 5.5 [Sequence as a whole] Positioning and selection

It is a common operation to extract a subsequence from a sequence (that is, to extract a subset from a set). We have learned to use the to() function and sequence of numbers to extract a subsequence according to the position of members.

Sometimes we want to get members from the end of a sequence in the reverse order. Of course, we can use the length of the sequence to calculate the sequence number, but it’s a bit troublesome. SPL provides functions that can use the reverse order.

The function A.m(i) will get the i-th member of A. if i<0, it will get the -i member counting from the end, that is

A.m(i)=A(if(i>0,i,i+A.len()+1))

Similarly, when p is a sequence of length k with negative members, there is a function

A.m(p)=[A.m(p(1)),…,A.m(p(k))]

A.m(-1) is often used to get the last member of A. Pascal triangle can be simplified a little bit:

	A	B
1	5	=[[1],[1,1]]
2	for A1-1	=B1.m(-1).(~+~[-1])
3		>B1\|=[B2\|1]

We can also select members by a certain step size:

A.step(k,i)=[A(i),A(i+k),A(i+2*k),…]

Starting from the i-th member, select one every k.

A.step(2,1) will select the odd numbered members, and A.step(2,2) will select the even numbered members.

Of course, it is more common to select members that meet a certain condition:

A.select(x) returns a sequence of members in A that make x true, where x can use the symbols allowed in loop functions such as ~ and #. The select() function is called a selection function, and is also a loop function.

The select() function is very common. Let’s give some examples:

	A	B
1	[13,30,45,23,42,98,61]
2	=A1.select(~%2==0)	[30,42,98]
3	=A1.select(~>30)	[45,42,98,61]
4	=A1.select(~>~[-1])	[13,30,45,42,98]
5	=A3.select((~-~[-1])%2==0)	[98]

The constant sequence of column B is the result of the select() function in column A, and A5 is further selected on the basis of A3.

In fact, A.step()can also be represented by select() function, that is

A.step(k,i)==A.select( #-i>0 && (#-i)%k==0 )

The sequence number is used as a condition to achieve the result of selection.

Using these functions, we calculate the prime numbers below 1000:

	A	B	C
1	=to(1000)	>A1(1)=0
2	for 2,31	if A1(A2)==0	next
3		>A1.step(A2,2*A2).select(~>0).run(A1(~)=0)
4	=A1.select(~>0)	=A4.select(~[1]-~==2).([~,~+2])

A composite number below 1000 always has a prime factor of no more than √1000 (<32). We only need to use the integers below 32 to filter the sequence to(1000), and exclude all the multiples of these numbers (fill in the corresponding members as 0). The remaining members that are not 0 must be prime numbers.

A1 generates the target sequence, B1 excludes 1(1 is not prime), that is, fill the member of the position with 0. Then A2 loops from 2 to 31, if the member at this position is already 0, it means that it is a composite number, leave it alone, skip to the next cycle. If it’s a prime number, pick out the positions of all its multiples, that is, A1.step(A2,A22). Pay attention to start from A22 and skip A2 itself, because A2 is a prime number and needs to be kept. Then, the members in these positions may have 0, which indicates that they have been screened out in the previous round. To exclude them, use select(~>0) to keep those that are not 0. The members in these positions are multiples of A2, that is, composite numbers. Set the members in these positions to 0, and then enter the next round.

At the end of the loop, select the remaining non-zero members, which are all prime numbers. By the way, B4 calculates all twin prime pairs (prime numbers with difference of 2) on this basis, using the select()function and A.() that we’ve learned, and the result is a two-layer sequence.

The selection function will get members out of the sequence. In the other way round, we will also care about whether a data is in the sequence and its position in the sequence.

The function A.pos(x) will return the position of x in A. More strictly, it is to find an i such that A(i)==x. If multiple A(i) are the same as x, it will return the first one. If not, it will return null. The latter case is more commonly used, that is, whether the return value of pos function is null is used to determine whether a data is in the sequence.

Similarly, there is an A.pseg(x) function. For a sequence A with increasing member value, that is, A(i)<A(i+1), the function will return the segment where x belongs, that is, if A(i)<=x<A(i+1), pseg()will return i; If x<A(1), it will return 0; If A.m(-1)<x, A.len() is returned. We take the increasing sequence as an example to explain the principle, actually it also works for decreasing sequence, but these less than signs should be changed to greater than signs.

In our daily work, we often need segmented statistics, such as age, sales amount, etc. We can use the pseg() function in these cases, but we will give these examples after we have learned structured data.

In theory, there may be duplicated values in the previous example of digital black hole, that is, a becomes b, b becomes c, and c becomes a again. In this case, it will lead to endless loop, but our current program can’t discover this situation. Only because we have done our homework in advance, we know for sure that there is only one single black hole in the four digits, so the code is written like that, but it may not be the case for other digits.

The pos()function can be used to find this kind of loop, and the pseg() function and insert() function can also be used to improve the sorting.

	A	B	C	D
1	5	12345	=A1.run(~=if(#>1,~[-1]*10,1))
2	for	=[]	=C1.(B1\~%10)
3		for C2	=B2.pseg(B3)	>B2.insert(if(C3==B2.len(),0,C3+1),B3)
4		=C1.sum(B2(#)*~)		=C1.sum(B2.m(-#)*~)
5		=@\|B1	>output(B1)	>B1=B4-D4
6		if B5.pos(B1)		Break

This program can find the digital black hole of any number of digits. A1 is the number of digits and B1 is the starting value.

Here we use the various techniques we have learned before, C1 uses run()to calculate a sequence [1,10,100,…], C2 uses it to get the number of each digit. Sorting needs only one line now, and the current result is saved in B2. B3 loops for each digit. In C3, pseg()is used to find out where the new digit should be inserted. In D3, insert() function is used to insert it. At the end of the loop, B2 is the sequence sorted from small to large. B4 and D4 use C1 again to calculate the large number and small number. The @ in B5 stands for itself. When the program starts to run, it will automatically clear all the cell values. The expression will continuously accumulate the calculated numbers and record them in a sequence. In B6, if it is found that the newly calculated number is already in the sequence, it means that the duplicated value appears and the program should be stopped.

It should be noted that pos()and pseg() are not loop functions, and the symbols such as ~, # can not and need not be used in the x of their parameters. If they are used, they are considered to belong to the loop function of the upper layer (these functions may be nested in a loop function).

These principles will be further explained at the end of this chapter.

We care about the members that meet the conditions, but sometimes we also care about the positions of those members. For example, we want to find out in which months the sales amount has increased from a sequence of 12-month sales amount. Using the select() function can only find the growing members, but not the positions of the growing members.

The A.pselect(x) function can help us by returning the sequence numbers of members that make x true. In terms of the symbols of a loop function, select()will return a sequence of ~ and pselect() will return the corresponding #. pselect() is called the positioning function.

Let’s try:

	A	B
1	[13,30,45,23,42,98,61]
2	=A1.pselect(~%2==0)	[2,5,6]
3	=A1.pselect(~>30)	[3,5,6,7]
4	=A1.pselect(~>~[-1])	[1,2,3,5,6]

We expect the result of column A to be the value of column B, but it is not? A2, A3 and A4 return 2, 3 and 1 respectively, which is only the first value in the sequence of column B.

What’s going on?

It turns out that the pselect() function is defined as such, it only returns the first # that satisfies the condition.

So, what should we do if we want to return all the qualified #, like select?

	A	B
1	[13,30,45,23,42,98,61]
2	=A1.pselect@a(~%2==0)	[2,5,6]
3	=A1.pselect@a(~>30)	[3,5,6,7]
4	=A1.pselect@a(~>~[-1])	[1,2,3,5,6]

Just add @a after the function name.

What is this?

The writing method of @a is invented by SPL, which is called function option. In theory, pselect and pselect@a are two unrelated different functions, but these two functions are very similar. They are both for sequences, and the parameters are the same. Although the functions are not exactly the same, they are very similar. In this case, we are more willing to regard one of them as a variant of the other, and use similar names when naming, which will be more convenient for understanding and memorizing.

But these two functions are still different and need to be distinguished. In SPL, we use the same name to name these two functions, and use the character after @ to distinguish them. It seems that a function has different modes: when there is no @a, pselect only returns the first one, and when there is @a, pselect will return all values.

There are many functions with options in SPL, and there are often more than one options. For example, pselect()also has a @z option, which means to search from the end and in reverse direction. In the above example, A1.pselect@z(~%2==0) will return 6. Moreover, these options may be used in combination, A1.pselect@az(~%2==0) means to find all the # that meet the condition from the end and in reverse direction, and the result will return [6,5,2].

There is no order in the writing of options, pselect@az and pselect@za are the same.

Correspondingly, the select()function has a @1 option, which means that the first member satisfying the condition will be returned. The reason for this design is that it is more common for select() to return all values, while it is more common for pselect() to return the first.

SPL function options can greatly simplify the code. At present, there is no such syntax in other programming languages, and the function of some languages also has the concept of option, but it generally exists as an independent parameter. We will see later that in structured data processing, parameters are very complicated, and it is not convenient to pass option information as parameters.

We use the max()function to find the maximum value of the sequence members. Similarly, we may also care about the position of the maximum value. SPL also provides A.pmax() function to return the position of the maximum value in sequence A.

The maximum value of a sequence is unique, but there may be more than one member with the maximum value. Therefore, pmax()is similar to pselect(), by default it returns the position of the first maximum value. If the option @a is added, it will return all positions, and @z will search from the back to the front.

The pmax()function is not difficult to understand, so we don’t give examples here. And, of course, there is also pmin() function.

We mention pmax()here to deepen the understanding of positioning function. pmax() is also a positioning function. Positioning is a common operation, but there is no corresponding library function in many programming languages. SPL pays special attention to this kind of operation.

Forget to mention that the pos() function also has @a and @z options.

It’s easy to ignore that there is another selection function maxp()that is related to max().

A.maxp()is defined as A(A.pmax()), which is the member that returns the maximum value.

If you think about it, isn’t it A.max()? Why create a new function?

It’s still different. In a simple way, A.pmax()has the @a option, which may return multiple values, so that A.maxp@a() may also return a sequence, but A.max() will not.

However, even if A.maxp@a() is a sequence, the members are all the same. It doesn’t seem very interesting.

The bigger difference is in the case of with parameters. A.max(x) returns the maximum x, while A.maxp(x) returns the member that maximizes x. For example:

	A	B
1	[13,30,45,23,42,98,61]
2	=A1.max(~%10)	8
3	=A1.maxp(~%10)	98

See the difference?

We know that A.max(x) is A.(x).max(), A.pmax(x) is also equal to A.(x).pmax(), but A.maxp(x) is not A.(x).maxp(), it is still defined as A(A.pmax(x)).

max()returns the maximum value itself, while maxp() returns the member that makes the maximum value appear. For just a sequence, they are the same, but with an expression, you can find their difference. In the structured data processing that will be covered later, this kind of calculation is very common. For example, we often care more about the product that leads to the highest sales amount than the highest sales amount itself.

max()is not a selection function, it returns a value. maxp() returns a member(or sequence of members), and is a selection function. Like positioning function, selection function is also a big kind of operation.

Of course, there is also minp() function.

SPL Programming - Preface
SPL Programming - 5.4 [Sequence as a whole] Iterative function*
SPL Programming - 5.6 [Sequence as a whole] Sorting related

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/cFTcUNs7

Youtube 👉 https://www.youtube.com/@esProc_SPL

spl programming(50)

eBook