"Binary operations [pic] Set operations (File comparison) Here are f1.txt and f2.txt (Sam .."

pjf RaqForum 22 No.
1 Reply • 846 View • 4 Years ago

Structured Text File Processing in SPL (II)

Binary operations

Set operations (File comparison)

Here are f1.txt and f2.txt (Sample data is listed below). Their first lines are column names. Task: calculate intersection between their Name fields.

f1.txt:

Name Dept

Rachel Sales

Ashley R&D

Matthew Sales

Alexis Sales

Megan Marketing

f2.txt:

Name Dept

Emily HR

Ashley R&D

Matthew Sales

Alexis Sales

Megan Marketing

SPL script:

	A	B
1	=file("E:\\f1.txt").import@t()	=file("E:\\f2.txt").import@t()
2	=[A1.(Name),B1.(Name)].isect()

isect() function calculates intersection of two sets. A1.(Name) retrieves A1’s Name column and return it as a sequence; B1.(Name) retrieves B1’s Name column and return it as a sequence. Below is the final result:

undefined

Similarly, we use union() function to calculate union, diff() function to calculate difference, and conj() function to calculate concatenation (which is equivalent to union all). You can also use operators to generate more concise statement, as shown below:

A1.(Name) ^ B1.(Name)

A1.(Name) & B1.(Name)

A1.(Name) \ B1.(Name)

A1.(Name) | B1.(Name)

In the above example, the whole text file is imported and automatically split into multiple fields and one field is retrieved for calculating intersection. To compare lines of data, just add @s option to import the text file as a single-column table sequence where the column name is Name(Tab)Dept, which contains the invisible character tab. The column name is illegal and can’t be referenced directly. In this case, we can use numbers to represent the column names to write the following code:

	A	B
1	=file("D:\\f1.txt").import@ts()	=file("D:\\f2.txt").import@ts()
2	=A1.(#1)^B1.(#1)

A convenient alternative is to return only the single-column detailed data as a set (sequence) by adding @i option. This way you can directly calculate the intersection with A1^B1:

	A	B
1	=file("D:\\f1.txt").import@tis()	=file("D:\\f2.txt").import@tis()
2	=A1^B1

The two algorithms have same results:

undefined

Merge

Below are two text files – table1.txt and table2.txt, sorted by logical primary key ID1 and ID2 respectively. Task: Update table1 with data in table2 according to the primary keys. Rule: Replace a record’s field values in table1 with values in the corresponding record in table2 if their primary key values are same, and insert the table2’s record to table1 if their primary key values are different.
Source files:

Table1

Table2

ID1 ID2 Col1 Col2

2 B1 2B1 row

2 B2 2B2 row

3 B1 3B1

4 B3 4B3 Row

ID1 ID2 Col1 Col2

1 B1 diff row

2 B1 diff row

2 diff diff row

3 B1 3B1

Here’s the updated table1:

ID1	ID2	Col1	Col2
1	B1	diff	row
2	B1	diff	row
2	B2	2B2	row
2	diff	diff	row
3	B1	3B1
4	B3	4B3	row

SPL script:

	A	B
1	=file("D:\\table1.txt").cursor@t()	=file("D:\\table2.txt").cursor@t()
2	=[B1,A1].mergex@u(ID1,ID2)
3	=file("D:\\result.txt").export@t(A2)

Read in the two files as cursors and update A1 with B1 by their logical primary keys. mergex() function merges multiple data sets while keeping the result set in order; @u option enables calculating their union. The merging result will be exported to a target file.

Cursors can handle files of any size. If a file is unordered, you need to sort it first by adding a sorting expression, as shown below:

[B1.sortx(ID1,ID2),A1.sortx(ID1,ID2)].mergex@u(ID1,ID2)

Order-based set operations

Both files below are ordered by Name and Dept. Task: calculate their intersection.

f1.txt:

Name Dept

Alexis Sales

Ashley R&D

Matthew Sales

Megan Marketing

Rachel Sales

f2.txt:

Name Dept

Alexi Sales

Ashley R&D

Emily HR

Matthew Sales

Megan Marketing

This is an order-based set operation. We use merge() function to calculate the intersection, which is more efficient. Below is SPL script:

	A	B
1	=file("D:\\f1.txt").import@t()	=file("D:\\f2.txt").import@t()
2	=[B1,A1].merge@i(Name,Dept)
3	=file("D:\\result.txt").export@t(A2)

merge() function merges table sequences. It works with @i option to calculate intersection, @u option to calculate union, and @d option to calculate difference.

Here’s the final result:

undefined

Sort files with sort() function if they are unordered. Note that sorting and merging small files is slower than processing them with common set-operation functions. merge() is more suitable for handling large files.

Association operations

emp.txt is a tab separated text file, where EId field points to sales.txt’s SellerId field. Task: Align emp.txt’s Dept field, Name field and Gender field to sales.txt.

emp.txt:

EId	State	Dept	Name	Gender	Salary	Birthday
2	New York	Finance	Ashley	F	11000	1980/07/19
3	New Mexico	Sales	Rachel	F	9000	1970/12/17
4	Texas	HR	Emily	F	7000	1985/03/07
5	Texas	R&D	Ashley	F	16000	1975/05/13
6	California	Sales	Matthew	M	11000	1984/07/07
7	Illinois	Sales	Alexis	F	9000	1972/08/16
8	California	Marketing	Megan	F	11000	1979/04/19

SPL script:

	A
1	=sOrder=file("D:\\sales.txt").import@t()
2	=emp=file("D:\\emp.txt").import@t(EId,Name,Dept,Gender)
3	=join@1(sOrder:s,SellerId;emp:e,EId)
4	=A3.new(s.OrderID, s.Client, s.SellerId, s.Amount, s.OrderDate, e.Name, e.Dept, e.Gender)

join() function performs a left join where two tables are renamed s and e respectively. @l enables a left join and @f enables a full join. By default the function performs an inner join. A4 retrieves desired fields from the joining result to form a new two-dimensional table, as shown below:

undefined

Complex calculations

Multilevel association

In the following diagram, there are 5 data files. Orders is the fact table and the other four are dimension tables. Task: Find orders where the customer and the supplier are in the same region, group those orders by cities, and count orders and sum the orders amount for each city.

The tables’ relationships are as follows:

undefined

SPL script:

	A
1	=file("D:/files/orders.txt").import@t()
2	=file("D:/files/customer.txt").import@t()
3	=file("D:/files/product.txt").import@t()
4	=file("D:/files/supplier.txt").import@t()
5	=file("D:/files/region.txt").import@t()
6	=A2.switch(city,A5:city)
7	=A4.switch(city,A5:city)
8	=A3.switch(sid,A4:sid)
9	=A1.switch(pid,A3:pid; cid,A2:cid)
10	=A1.select(pid.sid.city.region==cid.city.region)
11	=A10.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount)

Import the files respectively, create associations between the fact table and each of the dimension tables, get eligible orders according to the associations, and then groups them and performs aggregations. switch() function creates foreign key style associations.

Compare files of different structures

Data.txt is a tab-delimited, six-field text file, where here field contains semicolon-separated strings. List.txt is a single-column file. Task: If a record in Data.txt where the split here field value matches any line in List.txt, output this record to result.txt.

List.txt:

Gee

Whiz

Lol

Data.txt:

field1	field2	field3	here	field5	etc
A	B	2	Gee;Whiz;Hello	13	12
A	B	2	Gee;Whizz;Hi	56	32
E		4	Btm;Lol	16	2
T		3	Whizz	13	3

SPL script:

	A
1	=file("d:\\Data.txt").import@t()
2	=file("d:\\List.txt").read@n()
3	=A1.select(here.array(";")^A2!=[])
4	=file("d:\\result.txt").export@t(A3)

select() function gets desired records by splitting each here field value into a sequence of strings with array() function and calculating its intersection with A2’s sequence using operator “^” and according to the condition that the intersection result isn’t empty.

The final result:

field1	field2	field3	here	field5	etc
A	B	2	Gee;Whiz;Hello	13	12
A	B	2	Gee;Whizz;Hi	56	32
E		4	Btm;Lol	16	2

Data extraction from a multilevel directory

D:\files is a multilevel directory where each subdirectory contains many text files. Task: Retrieve a specified line (say, the 2^nd one) from each file to write to d:\result.txt.

SPL script:

	A
1	=directory@p(path)
2	=A1.(file(~).cursor@s())
3	=A2.((~.skip(1),~.fetch@x(1)))
4	=A3.union()
5	=file("d:\\result.txt").export@a(A4)
6	=directory@dp(path)
7	=A6.(call("c:\\readfile.dfx",~))

Set the initial value of parameter path as D:\files to begin data extraction; then call the script (c:\readfile.dfx) iteratively to pass different value to parameter path each time.

directory() function gets files under the root directory specified by parameter path; @p option requires full paths and @d gets directory names only.

~.skip(1) skips the first line.

~.fetch@x(1) retrieves the line at the current position (the 2^nd line) and closes the cursor.

Split a file as multiple ones

sales.txt records a large number of orders. Task: Split the file into multiple files by year and month; the file name format is year-month.txt.

SPL script:

	A
1	=file("D:\\ sales.txt").import@t()
2	=A1.group(string(OrderDate,"yyyy-MM");~)
3	=A2.run(file("d:\\temp\\"+#1+".txt").export(#2))

Group the file by year and month and loop each group to write data to a target file. Below is file 2009-01.txt:

65 YZ 8 29600.0 2009-01-06

62 JAXE 11 8134.0 2009-01-06

64 HP 13 20000.0 2009-01-02

60 PWQ 16 3430.0 2009-01-05

63 SJCH 16 5880.0 2009-01-02

61 SJCH 19 1078.0 2009-01-08

Use cursor to import the file if its size is greater than memory capacity. If a group is still can’t be imported into the memory wholly, use groupx() function to group the file. The code structure maintains unchanged.

Stock calculation

Stock.txt stores in-out stock records. One item may have multiple in-out records per day, or may have no records in days. The initial value for an item is 0. In means in-stock and out means out-stock. Task: Calculate stock for each item. Below is the source data:

date name quantity flag

2014-04-01 Item1 15 In

2014-04-01 Item1 4 In

2014-04-02 Item1 3 In

2014-04-02 Item1 10 Out

2014-04-03 Item1 3 In

2014-04-04 Item1 5 Out

2014-04-07 Item1 4 In

2014-04-10 Item1 2 Out

2014-04-01 Item2 20 In

2014-04-02 Item3 30 In

2014-04-03 Item3 14 Out

SPL script:

	A	B
1	=file("D:\\stock.txt").import@t()
2	=A1.group(name,date;~.select(flag==”In”).sum(quantity):in,~.select(flag==”Out”).sum(quantity):out)
3	=A2.group(name)	=periods((t=A2.id(date)).min(),t.max(),1)
4	for A3	=A4.align(B3,date)
5		>c=0
6		=B4.new(A4.name:name, B3(#):date, c:Opening, in, (b=c+in):Total, out, (c=b-out):Close)
7		>A8=A8\|B6
8

A2 calculates the sum of quantities per item per day. B3 gets complete, continuous dates according to the earliest and the latest dates. Then group records by items and loop over each group to align records to B3. B6 calculates the each item’ daily stock and stores all stocks in A8.

Below is A8’s result:

undefined

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/cFTcUNs7

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

pjf • 846 View • 4 Years ago

Structured Text File Processing in SPL (II)

Binary operations

Set operations (File comparison)

Merge

Order-based set operations

Association operations

Complex calculations

Multilevel association

Compare files of different structures

Data extraction from a multilevel directory

Split a file as multiple ones

Stock calculation

ToC