Python vs. SPL 11 -- Many-to-One Association

 

In Python vs. SPL 10 -- One-to-N Association, we introduce one-to-one and one-to-N association. And this article will compare the computational abilities of Python and SPL in many-to-one association.

 

Foreign key association

When some fields of table A are associated with the primary key of table B, the associative fields of table A can be many, and the associative field of table B is distinct. Such scenario is a many-to-one association, also known as foreign key association, that is, table A is a fact table, and table B is a dimension table. The fields of table A associated with the primary key of table B are called the foreign keys of A to B, and table B is also called the foreign key table of A. For example:

 

There is a sale record table and a product information table. The calculation task is to aggregate the sale amount of each kind of product.

 

Some of the data in sale record table, and product information table are as follows:

 

sale record table (fact table):

recordid product sale_city amount
sr100001 p1003 c104 380
sr100002 p1005 c103 400
sr100003 p1003 c104 626

 

Product information table (dimension table):

productid pclass
p1001 A
p1002 A
p1003 B

 

Python

import pandas as pd

sr_file1="D:\data\SaleRecord.csv"

pt_file1="D:\data\Product.csv"

record1=pd.read_csv(sr_file1)

product1=pd.read_csv(pt_file1)

r_pt=pd.merge(record1,product1,left_on="product",right_on="productid")

pclass_sale=r_pt.groupby('pclass',as_index=False).amount.sum()

print(pclass_sale)

 

 

 

Fact table

 

Dimension table

Associate fact table with the foreign keys of dimension table

 

 

The merge function in Python associates two tables; sale record table “record1” is the fact table, and product information table “product1” is the dimension table. Many records in “record1” correspond to one record in “product1”, and the names of associative fields in two tables are different, so left_on and right_on mark the associative field of two tables respectively so that the two tables are associated as a wide table, then group and aggregate the records to get the final result.

 

SPL

A B
1 D:\data\SaleRecord.csv
2 D:\data\Product.csv
3 =file(A1).import@tc()
4 =file(A2).import@tc()
5 =A3.switch(product,A4:productid) /convert foreign keys to records of dimension table
6 =A5.groups(product.pclass;sum(amount):amount)

 

The switch function in SPL converts the foreign keys to corresponding records of the dimension table, and since they are records now, they can certainly reference to fields which can be used to perform grouping and aggregation operations during grouping.

 

One fact table & multiple dimension tables

One fact can be associated with multiple dimension tables, for example:

 

We continue to use the sale record table (fact table) and product information table (dimension table 1), and a new city information table (dimension table 2) is added. The calculation task is to count the sale amount of each kind of product in each province.

 

City information table (dimension table 2):

cityid name province
c101 Beijing Beijing
c102 Tianjin Tianjin
c103 Harbin Heilongjiang

 

Python

#continue to use sr_file1 and pt_file1

 

ct_file1="D:\data\City.csv"

ct1=pd.read_csv(ct_file1)

r_ct=pd.merge(record1,ct1,left_on="sale_city",right_on="cityid")

r_ct_pdt=pd.merge(r_ct,product1,left_on="product",right_on="productid")

ct_pdt_sale=r_ct_pdt.groupby(['province','pclass'],as_index=False).amount.sum()

print(ct_pdt_sale)

 

 

 

Associate fact table with dimension table 2

Associate fact table with dimension table 1

Group and aggregate

 

When associating multiple dimension tables, Python usually associates one table first and then the other table. After executing the merge function twice, a big wide table is generated which is used to perform the grouping and aggregate operations.

 

SPL

A B
/A3 is sale record table, and A4 is product information table
8 D:\data\City.csv
9 =file(A8).import@tc()
10 =A3.switch(product,A4:productid;sale_city,A9:cityid) /set primary key
11 =A10.groups(sale_city.province,product.pclass;sum(amount):amount) /group and aggregate

 

The switch function in SPL can create many foreign key associations simultaneously such as the ID number of product “product” and the “productid” in product information table, and ID number of city “sale_city” and “cityid” in city information table. More associations can be created if needed, and the fields of records can be used to perform grouping and aggregate operations after being associated. Different from Python, SPL can parse multiple associative relations at a time, which makes the association explicit and more efficient.

 

Reuse dimension table

One fact table may use the same dimension table multiple times, for example:

 

Based on sale record table and city information table, select the sale record whose sale city and producing city are the same one.

 

sale record table 2 (fact table):

recordid product product_city sale_city amount
sr100001 p1006 c105 c103 603
sr100002 p1005 c105 c102 1230
sr100003 p1003 c102 c102 885

City information table 2 (dimension table):

cityid name province
c101 Beijing Beijing
c102 Tianjin Tianjin
c103 Harbin Heilongjiang

 

Python

sr_file2="D:\data\SaleRecord2.csv"

ct_file2="D:\data\City2.csv"

record2=pd.read_csv(sr_file2)

ct2=pd.read_csv(ct_file2)

r_ct2=pd.merge(record2,ct2,left_on="sale_city",right_on="cityid")

r_ct_ct=pd.merge(r_ct2,ct2,left_on="product_city",right_on="cityid",suffixes=('_s',   '_p'))

r_ct_p_ct= r_ct_ct[r_ct_ct['province_s']==r_ct_ct['province_p']].recordid

print(r_ct_p_ct)

 

 

 

 

Associate fact table with dimension table for the first time

Associate fact table with dimension table for the second time

 

 

 

 

The sale records in the example include the ID numbers of sale city and producing city, both of which can be associated with the “cityid” of city information table. Python still uses the same method, executing the merge function twice, and merge function will generate the same field names in the second time, but Python can handle such a problem successfully by adding a different suffix.

 

SPL

A B

13 D:\data\SaleRecord2.csv
14 D:\data\City2.csv
15 =file(A13).import@tc()
16 =file(A14).import@tc()
17 =A15.switch(sale_city,A16:cityid;product_city,A16:cityid) /associate fact table with dimension table
18 =A17.select(sale_city.province==product_city.province).(recordid)

 

SPL uses the switch function to associate the same dimension table, but with different foreign keys (sale_city and product_city), and then uses the associated record fields to select the target result.

 

Multi-layer dimension table

Foreign key association may involve more than one layer of dimension table. In other words, there are scenarios of multiple layers of the dimension table. For example:

 

Based on the sale record table, product information table, and city information table, select the sale record whose sale city and producing city are in the same province.

 

Sale record table (fact table):

recordid product sale_city amount
sr100001 p1003 c104 380
sr100002 p1005 c103 400
sr100003 p1003 c104 626

 

City information table (dimension table 1):

cityid name province
c101 Beijing Beijing
c102 Tianjin Tianjin
c103 Harbin Heilongjiang

 

Product information table (dimension table 2):

productid product_city
p1001 c104
p1002 c103
p1003 c102

 

Python

sr_file3="D:\data\SaleRecord3.csv"

ct_file3="D:\data\City3.csv"

pt_file3="D:\data\Product3.csv"

record3=pd.read_csv(sr_file3)

product3=pd.read_csv(pt_file3)

ct3=pd.read_csv(ct_file3)

pdt_ct=pd.merge(product3,ct3,left_on="product_city",right_on="cityid")

r_pdt_ct=pd.merge(record3,pdt_ct,left_on="product",right_on="productid")

r_pdt_ct_ct=pd.merge(r_pdt_ct,ct3,left_on="sale_city",right_on="cityid",suffixes=('_s',   '_p'))

r_ct_p_ct2=r_pdt_ct_ct[r_pdt_ct_ct['province_s']==r_pdt_ct_ct['province_p']].recordid

print(r_ct_p_ct2)

 

 

 

 

 

 

Associate producing city with city

Associate sale record with product

Associate sale city with city

 

 

 

The city information is the dimension table of both product and sale record; product information is also the dimension table of sale record, which constitutes multiple layers of dimension tables together, and there is dimension table that is associated multiple times. Python uses the merge function three times for three associations.

 

SPL

A B

20 D:\data\SaleRecord3.csv
21 D:\data\City3.csv
22 D:\data\Product3.csv
23 =file(A20).import@tc()
24 =file(A21).import@tc()
25 =file(A22).import@tc()
26 =A25.switch(product_city,A24:cityid) /associate producing city with city
27 =A23.switch(sale_city,A24:cityid;product,A26:productid) /associate sale record with city and product
28 =A27.select(sale_city.province==product.product_city.province).(recordid)

 

Once an association is created, SPL can use it all the time, even when the association is created again. For example, we create associations on producing city and city in A26, and on sale record and product in A27; besides, the association between producing city and city still exists. Therefore, we can have reference of product.product_city.province in A28, which is quite convenient for multiple table association.

 

Self-association

Sometimes we may also encounter a scenario where a table is both a fact table and a dimension table, i.e., the table associates with itself. For example:

 

There is an employee information table, and the calculation task is to list names of all employees and their superiors.

 

Some of the employee information table are as follows:

empid name superior
7902 FORD 7566
7788 SCOTT 7566
7900 JAMES 7698

 

Python

emp_file="D:\data\Employee_.csv"

emp=pd.read_csv(emp_file)

emp_s=pd.merge(emp,emp,left_on="superior",right_on="empid",suffixes=('',  '_m'),how="left")

emp_s_name=emp_s[['name','name_m']]

print(emp_s_name)

 

 

Self associate

 

 

 

 

The operation of Python is still two-table association essentially.

 

SPL

A B

30 D:\data\Employee_.csv
31 =file(A30).import@tc()
32 =A31.switch(superior,A31:empid) /self associate
33 =A32.new(name,superior.name:s_name)

 

SPL also follows the same operation logic, using switch function to associate “superior” and “empid”.

 

Circle association

When associative relation is complex, circle association may occur. For example:

 

There is an employee information table and a department information table, and the calculation task is to select Beijing employees of Beijing manager.

 

Employee information tale:

empid name dept province
1 Rebecca 6 Beijing
2 Ashley 2 Tianjin
3 Rachel 7 Heilongjiang

 

Department information table:

deptid name manager
1 Administration 20
2 Finance 2
3 HR 162

 

Python

emp_file2="D:\data\Employee_2.csv"

dept_file2="D:\data\Department2.csv"

emp2=pd.read_csv(emp_file2)

dept2=pd.read_csv(dept_file2)

d_emp=pd.merge(dept2,emp2,left_on="manager",right_on="empid")

emp_d_emp=pd.merge(emp2,d_emp,left_on="dept",right_on="deptid",suffixes=('',  '_m'))

beijing_emp_m=emp_d_emp[(emp_d_emp['province']=="Beijing") & (emp_d_emp['province_m']=="Beijing")].name

print(beijing_emp_m)

 

 

 

 

Associate department table with employee table

Associate employee table with department table

 

Select

 

 

The above two associations are relatively independent from each other in Python. These two associations constitute a circle association to generate a wide table, and then the target result is selected.

 

SPL

A B

35 D:\data\Employee_2.csv
36 D:\data\Department2.csv
37 =file(A35).import@tc()
38 =file(A36).import@tc()
39 =A38.switch(manager,A37:empid) /associate department table with employee table
40 =A37.switch(dept,A38:deptid) /associate employee table with department table
41 =A40.select(province=="Beijing"&&dept.manager.province=="Beijing").(name) /select

 

SPL handles such association in three steps: first, it creates association on department and employee; second, it creates association on employee and department; third, it directly selects the target result using the created associations. The association operations in the previous examples are all done with the switch function which possesses a feature: the original field values will be replaced with the associated records once the association is done, and the original record values will not exist any longer. If we want to keep the original record values, the join function can be used to perform the association. For example, A40 in the example can be written as:

A40=A37.join(dept,A38:deptid,~:dpt). At this time, “dept” is the associated records which can be referenced to perform the subsequent operations. And the switch function in the previous examples can all be used in this way.

 

Mixed association

During data analysis, we may encounter mixed associations where homo-dimension, primary-sub, and foreign key associations occur at the same time, and it is when the associative relations are very complex and need to be clearly sorted out. For example:

 

Based on the order table, order detail table, product information table, employee information table, travel information table, client information table, and city information table, the task is to calculate the sale amount of Heilongjiang products sold in each province by post-90s salesman who travel for more than 10 days.

 

The associative relations are shown below:

..

 

Python

emp4 = pd.read_csv("D:\data\Employee4.csv")

trv4 = pd.read_csv("D:\data\Travel4.csv")

emp_inf = pd.merge(emp4,trv4,on=["empid","name"])

years = pd.to_datetime(emp_inf.birthday).dt.year

emp_inf_c = emp_inf[(years>=1990) & (years<2000)&(emp_inf.time>=10)]

clt4 = pd.read_csv("D:\data\Client4.csv")

city4 = pd.read_csv("D:\data\City4.csv")

sale_location = pd.merge(clt4,city4,left_on='city',right_on='cityid')

pdt4 =   pd.read_csv("D:\data\Product4.csv")

pdt_location = pd.merge(pdt4,city4,left_on='city',right_on='cityid')

detail4 = pd.read_csv("D:\data\Detail4.csv")

order4 = pd.read_csv("D:\data\Order4.csv")

detail_pdt = pd.merge(detail4,pdt_location,on='productid',how="left")

order_sale_location = pd.merge(order4,sale_location,on='clientid',how="left")

order_sale_location_emp = pd.merge(order_sale_location,emp_inf_c,left_on='saleid',right_on='empid',how="left",suffixes=('_c', '_e'))

order_inf = order_sale_location_emp[order_sale_location_emp.empid.notnull()]

order_detail = pd.merge(order_inf,detail_pdt,on='orderid',how="left",suffixes=('_s', '_p'))

order_detail_Hljp = order_detail[order_detail.province_p=="Heilongjiang"]

res = order_detail_Hljp.groupby(['empid','name_e','province_s'],as_index=False).price.sum()

print(res)

 

 

Employee table and travel table

 

Select the post-90s employees

 

 

Client table and city table

 

Product table and city table

 

 

Order detail and producing city

Order and sale city

 

Order and employee

 

 

Order and order detail

Select

 

Group and aggregate

 

There are many tables in this example with complex associative relations which are homo-dimension association (one-to-one), primary-sub association (one-to-many), and foreign key association (many-to-one), respectively. If there exists an association of many-to-many, it is most likely wrong, and the association needs to be rechecked, otherwise, many-to-many association probably leads to memory explosion. As for such complex associations, the best method provided in Python is to use the merge function to associate every two tables and parse each association step by step, which may be a bit troublesome but less prone to errors.

 

SPL

A B

43 =file("D:/data/Employee4.csv").import@tc()
44 =file("D:/data/Travel4.csv").import@tc()
45 =A44.join(empid,A43:empid,birthday)
46 =A45.select((y=year(birthday),y>=1990&&y<2000&&time>=10))
47 =file("D:/data/Client4.csv").import@tc()
48 =file("D:/data/City4.csv").import@tc()
49 =A47.join(city,A48:cityid,province)
50 =file("D:/data/Product4.csv").import@tc()
51 =A50.join(city,A48:cityid,province)
52 =file("D:/data/Detail4.csv").import@tc()
53 =file("D:/data/Order4.csv").import@tc()
54 =A52.join(productid, A51:productid,province:product_province)
55 =A54.group(orderid)
56 =A53.switch(orderid, A55:orderid;saleid, A46:empid;clientid, A49:clientid)
57 =A56.select(saleid).new(saleid.empid:empid,saleid.name:sale_name,clientid.province:sale_location,orderid.select(product_province=="Heilongjiang").sum(price):price)
58 =A57.groups(empid,sale_name,sale_location;sum(price):price).select(price)

 

SPL is quite capable to handle such complex associations, in which the homo-dimension, primary-sub, and foreign key associations are all very clear. SPL can also associate two associations simultaneously, which is very fast and less error-prone.

 

Summary

When performing foreign key association, Python still copies data, and only parses one association at a time. In addition, every association is independent, so the association created previously can not be reused later, instead, it has to be re-associated, which results in low efficiency in association.

 

On the contrary, SPL can reuse the associations that created previously, making the operation much more effective.