"In Python vs. SPL 10 -- One-to-N Association, we introduce one-to-one and one-to-N association. .."

Hoo RaqForum 19 No.
1 Reply • 431 View • 2 Years ago

Python vs. SPL 11 -- Many-to-One Association

In Python vs. SPL 10 -- One-to-N Association, we introduce one-to-one and one-to-N association. And this article will compare the computational abilities of Python and SPL in many-to-one association.

Foreign key association

When some fields of table A are associated with the primary key of table B, the associative fields of table A can be many, and the associative field of table B is distinct. Such scenario is a many-to-one association, also known as foreign key association, that is, table A is a fact table, and table B is a dimension table. The fields of table A associated with the primary key of table B are called the foreign keys of A to B, and table B is also called the foreign key table of A. For example:

There is a sale record table and a product information table. The calculation task is to aggregate the sale amount of each kind of product.

Some of the data in sale record table, and product information table are as follows:

sale record table (fact table):

recordid	product	sale_city	amount	…
sr100001	p1003	c104	380	…
sr100002	p1005	c103	400	…
sr100003	p1003	c104	626	…
…	…	…	…	…

Product information table (dimension table):

productid	pclass	…
p1001	A	…
p1002	A	…
p1003	B	…
…	…	…

Python

import pandas as pd
sr_file1="D:\data\SaleRecord.csv"
pt_file1="D:\data\Product.csv"
record1=pd.read_csv(sr_file1)
product1=pd.read_csv(pt_file1)
r_pt=pd.merge(record1,product1,left_on="product",right_on="productid")
pclass_sale=r_pt.groupby('pclass',as_index=False).amount.sum()
print(pclass_sale)

 
Fact table
 
Dimension table
Associate fact table with the foreign keys of dimension table

The merge function in Python associates two tables; sale record table “record1” is the fact table, and product information table “product1” is the dimension table. Many records in “record1” correspond to one record in “product1”, and the names of associative fields in two tables are different, so left_on and right_on mark the associative field of two tables respectively so that the two tables are associated as a wide table, then group and aggregate the records to get the final result.

SPL

	A	B
1	D:\data\SaleRecord.csv
2	D:\data\Product.csv
3	=file(A1).import@tc()
4	=file(A2).import@tc()
5	=A3.switch(product,A4:productid)	/convert foreign keys to records of dimension table
6	=A5.groups(product.pclass;sum(amount):amount)

The switch function in SPL converts the foreign keys to corresponding records of the dimension table, and since they are records now, they can certainly reference to fields which can be used to perform grouping and aggregation operations during grouping.

One fact table & multiple dimension tables

One fact can be associated with multiple dimension tables, for example:

We continue to use the sale record table (fact table) and product information table (dimension table 1), and a new city information table (dimension table 2) is added. The calculation task is to count the sale amount of each kind of product in each province.

City information table (dimension table 2):

cityid	name	province	…
c101	Beijing	Beijing	…
c102	Tianjin	Tianjin	…
c103	Harbin	Heilongjiang	…
…	…	…	…

Python

#continue to use sr_file1 and pt_file1
 
ct_file1="D:\data\City.csv"
ct1=pd.read_csv(ct_file1)
r_ct=pd.merge(record1,ct1,left_on="sale_city",right_on="cityid")
r_ct_pdt=pd.merge(r_ct,product1,left_on="product",right_on="productid")
ct_pdt_sale=r_ct_pdt.groupby(['province','pclass'],as_index=False).amount.sum()
print(ct_pdt_sale)

 
Associate fact table with dimension table 2
Associate fact table with dimension table 1
Group and aggregate

When associating multiple dimension tables, Python usually associates one table first and then the other table. After executing the merge function twice, a big wide table is generated which is used to perform the grouping and aggregate operations.

SPL

	A	B
…	/A3 is sale record table, and A4 is product information table
8	D:\data\City.csv
9	=file(A8).import@tc()
10	=A3.switch(product,A4:productid;sale_city,A9:cityid)	/set primary key
11	=A10.groups(sale_city.province,product.pclass;sum(amount):amount)	/group and aggregate

The switch function in SPL can create many foreign key associations simultaneously such as the ID number of product “product” and the “productid” in product information table, and ID number of city “sale_city” and “cityid” in city information table. More associations can be created if needed, and the fields of records can be used to perform grouping and aggregate operations after being associated. Different from Python, SPL can parse multiple associative relations at a time, which makes the association explicit and more efficient.

Reuse dimension table

One fact table may use the same dimension table multiple times, for example:

Based on sale record table and city information table, select the sale record whose sale city and producing city are the same one.

sale record table 2 (fact table):

recordid	product	product_city	sale_city	amount	…
sr100001	p1006	c105	c103	603	…
sr100002	p1005	c105	c102	1230	…
sr100003	p1003	c102	c102	885	…
…	…	…	…	…	…

City information table 2 (dimension table):

cityid	name	province	…
c101	Beijing	Beijing	…
c102	Tianjin	Tianjin	…
c103	Harbin	Heilongjiang	…
…	…	…	…

Python

sr_file2="D:\data\SaleRecord2.csv"
ct_file2="D:\data\City2.csv"
record2=pd.read_csv(sr_file2)
ct2=pd.read_csv(ct_file2)
r_ct2=pd.merge(record2,ct2,left_on="sale_city",right_on="cityid")
r_ct_ct=pd.merge(r_ct2,ct2,left_on="product_city",right_on="cityid",suffixes=('_s',   '_p'))
r_ct_p_ct= r_ct_ct[r_ct_ct['province_s']==r_ct_ct['province_p']].recordid
print(r_ct_p_ct)

 
Associate fact table with dimension table for the first time
Associate fact table with dimension table for the second time

The sale records in the example include the ID numbers of sale city and producing city, both of which can be associated with the “cityid” of city information table. Python still uses the same method, executing the merge function twice, and merge function will generate the same field names in the second time, but Python can handle such a problem successfully by adding a different suffix.

SPL

	A	B
…	…
13	D:\data\SaleRecord2.csv
14	D:\data\City2.csv
15	=file(A13).import@tc()
16	=file(A14).import@tc()
17	=A15.switch(sale_city,A16:cityid;product_city,A16:cityid)	/associate fact table with dimension table
18	=A17.select(sale_city.province==product_city.province).(recordid)

SPL uses the switch function to associate the same dimension table, but with different foreign keys (sale_city and product_city), and then uses the associated record fields to select the target result.

Multi-layer dimension table

Foreign key association may involve more than one layer of dimension table. In other words, there are scenarios of multiple layers of the dimension table. For example:

Based on the sale record table, product information table, and city information table, select the sale record whose sale city and producing city are in the same province.

Sale record table (fact table):

recordid	product	sale_city	amount	…
sr100001	p1003	c104	380	…
sr100002	p1005	c103	400	…
sr100003	p1003	c104	626	…
…	…	…	…	…

City information table (dimension table 1):

cityid	name	province	…
c101	Beijing	Beijing	…
c102	Tianjin	Tianjin	…
c103	Harbin	Heilongjiang	…
…	…	…	…

Product information table (dimension table 2):

productid	product_city	…
p1001	c104	…
p1002	c103	…
p1003	c102	…
…	…	…

Python

sr_file3="D:\data\SaleRecord3.csv"
ct_file3="D:\data\City3.csv"
pt_file3="D:\data\Product3.csv"
record3=pd.read_csv(sr_file3)
product3=pd.read_csv(pt_file3)
ct3=pd.read_csv(ct_file3)
pdt_ct=pd.merge(product3,ct3,left_on="product_city",right_on="cityid")
r_pdt_ct=pd.merge(record3,pdt_ct,left_on="product",right_on="productid")
r_pdt_ct_ct=pd.merge(r_pdt_ct,ct3,left_on="sale_city",right_on="cityid",suffixes=('_s',   '_p'))
r_ct_p_ct2=r_pdt_ct_ct[r_pdt_ct_ct['province_s']==r_pdt_ct_ct['province_p']].recordid
print(r_ct_p_ct2)

 
Associate producing city with city
Associate sale record with product
Associate sale city with city

The city information is the dimension table of both product and sale record; product information is also the dimension table of sale record, which constitutes multiple layers of dimension tables together, and there is dimension table that is associated multiple times. Python uses the merge function three times for three associations.

SPL

	A	B
…	…
20	D:\data\SaleRecord3.csv
21	D:\data\City3.csv
22	D:\data\Product3.csv
23	=file(A20).import@tc()
24	=file(A21).import@tc()
25	=file(A22).import@tc()
26	=A25.switch(product_city,A24:cityid)	/associate producing city with city
27	=A23.switch(sale_city,A24:cityid;product,A26:productid)	/associate sale record with city and product
28	=A27.select(sale_city.province==product.product_city.province).(recordid)

Once an association is created, SPL can use it all the time, even when the association is created again. For example, we create associations on producing city and city in A26, and on sale record and product in A27; besides, the association between producing city and city still exists. Therefore, we can have reference of product.product_city.province in A28, which is quite convenient for multiple table association.

Self-association

Sometimes we may also encounter a scenario where a table is both a fact table and a dimension table, i.e., the table associates with itself. For example:

There is an employee information table, and the calculation task is to list names of all employees and their superiors.

Some of the employee information table are as follows:

empid	name	superior	…
7902	FORD	7566	…
7788	SCOTT	7566	…
7900	JAMES	7698	…
…	…	…	…

Python

emp_file="D:\data\Employee_.csv"
emp=pd.read_csv(emp_file)
emp_s=pd.merge(emp,emp,left_on="superior",right_on="empid",suffixes=('',  '_m'),how="left")
emp_s_name=emp_s[['name','name_m']]
print(emp_s_name)

 
Self associate

The operation of Python is still two-table association essentially.

SPL

	A	B
…	…
30	D:\data\Employee_.csv
31	=file(A30).import@tc()
32	=A31.switch(superior,A31:empid)	/self associate
33	=A32.new(name,superior.name:s_name)

SPL also follows the same operation logic, using switch function to associate “superior” and “empid”.

Circle association

When associative relation is complex, circle association may occur. For example:

There is an employee information table and a department information table, and the calculation task is to select Beijing employees of Beijing manager.

Employee information tale:

empid	name	dept	province	…
1	Rebecca	6	Beijing	…
2	Ashley	2	Tianjin	…
3	Rachel	7	Heilongjiang	…
…	…	…	…	…

Department information table:

deptid	name	manager	…
1	Administration	20	…
2	Finance	2	…
3	HR	162	…
…	…	…	…

Python

emp_file2="D:\data\Employee_2.csv"
dept_file2="D:\data\Department2.csv"
emp2=pd.read_csv(emp_file2)
dept2=pd.read_csv(dept_file2)
d_emp=pd.merge(dept2,emp2,left_on="manager",right_on="empid")
emp_d_emp=pd.merge(emp2,d_emp,left_on="dept",right_on="deptid",suffixes=('',  '_m'))
beijing_emp_m=emp_d_emp[(emp_d_emp['province']=="Beijing") & (emp_d_emp['province_m']=="Beijing")].name
print(beijing_emp_m)

 
Associate department table with employee table
Associate employee table with department table
 
Select

The above two associations are relatively independent from each other in Python. These two associations constitute a circle association to generate a wide table, and then the target result is selected.

SPL

	A	B
…	…
35	D:\data\Employee_2.csv
36	D:\data\Department2.csv
37	=file(A35).import@tc()
38	=file(A36).import@tc()
39	=A38.switch(manager,A37:empid)	/associate department table with employee table
40	=A37.switch(dept,A38:deptid)	/associate employee table with department table
41	=A40.select(province=="Beijing"&&dept.manager.province=="Beijing").(name)	/select

SPL handles such association in three steps: first, it creates association on department and employee; second, it creates association on employee and department; third, it directly selects the target result using the created associations. The association operations in the previous examples are all done with the switch function which possesses a feature: the original field values will be replaced with the associated records once the association is done, and the original record values will not exist any longer. If we want to keep the original record values, the join function can be used to perform the association. For example, A40 in the example can be written as:

A40=A37.join(dept,A38:deptid,~:dpt). At this time, “dept” is the associated records which can be referenced to perform the subsequent operations. And the switch function in the previous examples can all be used in this way.

Mixed association

During data analysis, we may encounter mixed associations where homo-dimension, primary-sub, and foreign key associations occur at the same time, and it is when the associative relations are very complex and need to be clearly sorted out. For example:

Based on the order table, order detail table, product information table, employee information table, travel information table, client information table, and city information table, the task is to calculate the sale amount of Heilongjiang products sold in each province by post-90s salesman who travel for more than 10 days.

The associative relations are shown below:

Python

emp4 = pd.read_csv("D:\data\Employee4.csv")
trv4 = pd.read_csv("D:\data\Travel4.csv")
emp_inf = pd.merge(emp4,trv4,on=["empid","name"])
years = pd.to_datetime(emp_inf.birthday).dt.year
emp_inf_c = emp_inf[(years>=1990) & (years<2000)&(emp_inf.time>=10)]
clt4 = pd.read_csv("D:\data\Client4.csv")
city4 = pd.read_csv("D:\data\City4.csv")
sale_location = pd.merge(clt4,city4,left_on='city',right_on='cityid')
pdt4 =   pd.read_csv("D:\data\Product4.csv")
pdt_location = pd.merge(pdt4,city4,left_on='city',right_on='cityid')
detail4 = pd.read_csv("D:\data\Detail4.csv")
order4 = pd.read_csv("D:\data\Order4.csv")
detail_pdt = pd.merge(detail4,pdt_location,on='productid',how="left")
order_sale_location = pd.merge(order4,sale_location,on='clientid',how="left")
order_sale_location_emp = pd.merge(order_sale_location,emp_inf_c,left_on='saleid',right_on='empid',how="left",suffixes=('_c', '_e'))
order_inf = order_sale_location_emp[order_sale_location_emp.empid.notnull()]
order_detail = pd.merge(order_inf,detail_pdt,on='orderid',how="left",suffixes=('_s', '_p'))
order_detail_Hljp = order_detail[order_detail.province_p=="Heilongjiang"]
res = order_detail_Hljp.groupby(['empid','name_e','province_s'],as_index=False).price.sum()
print(res)

 
Employee table and travel table
 
Select the post-90s employees
 
Client table and city table
 
Product table and city table
 
Order detail and producing city
Order and sale city
 
Order and employee
 
Order and order detail
Select
 
Group and aggregate

There are many tables in this example with complex associative relations which are homo-dimension association (one-to-one), primary-sub association (one-to-many), and foreign key association (many-to-one), respectively. If there exists an association of many-to-many, it is most likely wrong, and the association needs to be rechecked, otherwise, many-to-many association probably leads to memory explosion. As for such complex associations, the best method provided in Python is to use the merge function to associate every two tables and parse each association step by step, which may be a bit troublesome but less prone to errors.

SPL

	A	B
…	…
43	=file("D:/data/Employee4.csv").import@tc()
44	=file("D:/data/Travel4.csv").import@tc()
45	=A44.join(empid,A43:empid,birthday)
46	=A45.select((y=year(birthday),y>=1990&&y<2000&&time>=10))
47	=file("D:/data/Client4.csv").import@tc()
48	=file("D:/data/City4.csv").import@tc()
49	=A47.join(city,A48:cityid,province)
50	=file("D:/data/Product4.csv").import@tc()
51	=A50.join(city,A48:cityid,province)
52	=file("D:/data/Detail4.csv").import@tc()
53	=file("D:/data/Order4.csv").import@tc()
54	=A52.join(productid, A51:productid,province:product_province)
55	=A54.group(orderid)
56	=A53.switch(orderid, A55:orderid;saleid, A46:empid;clientid, A49:clientid)
57	=A56.select(saleid).new(saleid.empid:empid,saleid.name:sale_name,clientid.province:sale_location,orderid.select(product_province=="Heilongjiang").sum(price):price)
58	=A57.groups(empid,sale_name,sale_location;sum(price):price).select(price)

SPL is quite capable to handle such complex associations, in which the homo-dimension, primary-sub, and foreign key associations are all very clear. SPL can also associate two associations simultaneously, which is very fast and less error-prone.

Summary

When performing foreign key association, Python still copies data, and only parses one association at a time. In addition, every association is independent, so the association created previously can not be reused later, instead, it has to be re-associated, which results in low efficiency in association.

On the contrary, SPL can reuse the associations that created previously, making the operation much more effective.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Hoo • 431 View • 2 Years ago

Python vs. SPL 11 -- Many-to-One Association

Foreign key association

One fact table & multiple dimension tables

Reuse dimension table

Multi-layer dimension table

Self-association

Circle association

Mixed association

Summary

ToC