8.1 Foreign key association

 

7.5 Transpose


8.1.1 Ordinary foreign key association

Part of the data in the sales record table, city information table and product information table are given as below:

Sales record table (fact table):

recordid product product_city sale_city amount
sr100001 p1006 c105 c103 603
sr100002 p1005 c105 c102 1230
sr100003 p1003 c102 c102 885

City information table (dimension table):

cityid name state
c101 New York New York
c102 Miami Florida
c103 Los Angeles California

Product information table (dimension table):

productid pclass
p1001 A
p1002 A
p1003 B

Perform the following calculations based on the above data:

1. One fact table and one dimension table - aggregate the sales of each product.

2. One fact table and multiple dimension tables - count the sales of each product of each state.

3. Reuse dimension table

1) Find the sales record of product sold and manufactured in the same state.

2) Calculate the sales of product sold in California.

3) Calculate the total sales of each state as the place of production.

4. Multi-layer dimension table – find the sales records of product manufactured and sold in the same state.

Sometimes, the city information of a certain product is not in the sales record table but in the product information table. In this case, the association is a multi-layer dimension table association.

The modified sales record table and product information table are as follows:

Sales record table 2:

recordid product sale_city amount
sr100001 p1003 c104 380
sr100002 p1005 c103 400
sr100003 p1003 c104 626

Product information table 2:

productid product_city
p1001 c104
p1002 c103
p1003 c102

SPL

A B
1 =file(“SaleRecord.csv”).import@tc()
2 =file(“Product.csv”).import@tc()
3 =A1.switch(product,A2:productid) /1. One fact table and one dimension table
4 =A3.groups(product.pclass;sum(amount):amount) /1. One fact table and one dimension table
5 =file(“City.csv”).import@tc()
6 =A1.switch(product,A2:productid;sale_city,A5:cityid) /2. One fact table and multiple dimension tables
7 =A6.groups(sale_city.state,product.pclass;sum(amount):amount) /2. One fact table and multiple dimension tables
8 =A1.switch(sale_city,A5:cityid;product_city,A5:cityid) /3. Reuse dimension table
9 =A8.select(sale_city.state==product_city.state).(recordid) /3. Reuse dimension table
10 =A8.select(sale_city.state==“California”).sum(amount) /The sales of product sold in California
11 =A8.groups(product_city.state;sum(amount):amount) /The total sales of each state as the place of production.
12 =file(“SaleRecord2.csv”).import@tc()
13 =file(“Product2.csv”).import@tc()
14 =A13.switch(product_city,A5:cityid) /4. Multi-layer dimension table
15 =A12.switch(sale_city,A5:cityid;product,A14:productid) /4. Multi-layer dimension table
16 =A15.select(sale_city.state==product.product_city.state).(recordid) /4. Multi-layer dimension table

The switch() function is to convert the foreign key field to the corresponding dimension table record. This function only references the dimension table record and does not copy the data and, it can parse multiple association relations at the same time and reuse the relations.

Notes: The above four tasks are independent of each other. Although the codes are written together, when executing, the code for the other three tasks should be commented out to execut expected code, otherwise the field that has been switched cannot be switched again.

The results of A3:

It should be specially noted that the associative relation in SPL is to directly reference the associative record, which allows all associative relations to be established in one go when loading data. In addition, SPL saves the associated results in a pointer manner, making it convenient to use this associative relation for subsequent calculations, which is commonly referred to as pre-association. For example, A8 is the pre-associated data, which can be directly referenced in subsequent A9, A10, and A11.

SQL

1. One fact table and one dimension table - aggregate the sales of each product

SELECT p.pclass AS pclass, SUM(s.amount) AS amount
FROM SaleRecord s
JOIN product p ON s.product = p.productid
GROUP BY p.pclass;

2. One fact table and multiple dimension tables - count the sales of each product of each state

SELECT c.state AS state, p.pclass AS pclass, SUM(s.amount) AS amount
FROM SaleRecord s
JOIN City c ON s.sale_city = c.cityid
JOIN Product p ON s.product = p.productid
GROUP BY c.state, p.pclass;

3. Reuse dimension table 1 - find the sales record of product sold & manufactured in the same state

SELECT s.recordid
FROM SaleRecord s
JOIN City c1 ON s.product_city = c1.cityid
JOIN City c2 ON s.sale_city = c2.cityid
JOIN Product p ON s.product = p.productid
WHERE c1.state = c2.state;

4. Reuse dimension table 2 - calculate the sales of product sold in California

SELECT SUM(amount) AS amount
FROM SaleRecord s
JOIN City c ON s.sale_city=c.cityid
WHERE c.state = 'California';

5. Reuse dimension table 3 - calculate the total sales of each state as the place of production.

SELECT c.state AS state, SUM(s.amount) AS amount
FROM SaleRecord s
JOIN City c ON s.product_city = c.cityid
GROUP BY c.state;

6. Multi-layer dimension table - find the sales records of product manufactured & sold in the same state.

SELECT s.recordid
FROM SaleRecord2 s
JOIN Product2 p ON s.product = p.productid
JOIN city c ON s.sale_city=c.cityid
JOIN city c2 ON p.product_city=c2.cityid
WHERE c.state=c2.state
ORDER BY s.recordid;

Python

record1=pd.read_csv("../SaleRecord.csv")
product1=pd.read_csv("../Product.csv")
r_pt=pd.merge(record1,product1,left_on="product",right_on="productid") #1.One fact table and one dimension table
pclass_sale=r_pt.groupby('pclass',as_index=False).amount.sum()
ct1=pd.read_csv("../City.csv")
r_ct=pd.merge(record1,ct1,left_on="sale_city",right_on="cityid") #2.One fact table and multiple dimension tables
r_ct_pdt=pd.merge(r_ct,product1,left_on="product",right_on="productid") #2.One fact table and multiple dimension tables
ct_pdt_sale=r_ct_pdt.groupby(['state','pclass'],as_index=False).amount.sum()
r_ct2=pd.merge(record1,ct1,left_on="sale_city",right_on="cityid") #3. Reuse dimension table
r_ct_ct=pd.merge(r_ct2,ct1,left_on="product_city",right_on="cityid",suffixes=('_s', '_p')) /#3. Reuse dimension table
r_ct_p_ct= r_ct_ct[r_ct_ct['state_s']==r_ct_ct['state_p']].recordid.values.tolist()
#ThesalesofproductsoldinCalifornia
hlj_sale_amount=r_ct_ct[r_ct_ct['state_s']=="California"].amount.sum()
#Thetotalsalesofeachstateastheplaceofproduction
state_product_amount=r_ct_ct.groupby(['state_p']).amount.sum()
record2=pd.read_csv("../SaleRecord2.csv")
product2=pd.read_csv("../Product2.csv")
pdt_ct=pd.merge(product2,ct1,left_on="product_city",right_on="cityid")
r_pdt_ct=pd.merge(record2,pdt_ct,left_on="product",right_on="productid") #4. Multi-layer dimension table
r_pdt_ct_ct=pd.merge(r_pdt_ct,ct1,left_on="sale_city",right_on="cityid",suffixes=('_s', '_p')) #4. Multi-layer dimension table
r_ct_p_ct2=r_pdt_ct_ct[r_pdt_ct_ct['state_s']==r_pdt_ct_ct['state_p']].recordid.values.tolist()

Associating in Python will generate a wide table, which completely copy the data of two tables. For example, the result of r-pt is as follows:

The associative relation created in Python can also be considered as pre-association. For example, the r_ct_ct in the reuse of dimension table can be regarded as the pre-associated result, except that Python creates a large wide table into which all the associative fields are put.

8.1.2 Cycle association

There are two tables: employee information table and department information table:

Employee information table:

empid name dept state partner
1 FORD 6 New York 12
2 SCOTT 2 Florida 12
3 JAMES 7 California 11

Department information table:

deptid name manager
1 Administration 20
2 Finance 2
3 HR 162

Perform the following calculations based on the data above:

1. Self-association - list the names of all employees and their partners

2. Cycle association - find the Florida employee of Florida manager

SPL

A B
1 =file(“Employee_1.csv”).import@tc()
2 =A1.switch(partner,A1:empid) /1. Self-association
3 =A2.new(name,partner.name:name_p)
4 =file(“Department2.csv”).import@tc()
5 =A4.switch(manager,A1:empid)
6 =A1.switch(dept,A5:deptid) /2. Cycle association
7 =A6.select(state==“Florida”&&dept.manager.state==“Florida”).(name) /2. Cycle association

SQL

1. Self-association - list the names of all employees and their partners

SELECT e.name AS name, p.name AS partner
FROM Employee_1 e
LEFT JOIN Employee_1 p ON e.partner = p.empid;

2. Cycle association - find the Florida employee of Florida manager

SELECT e.name AS employee_name
FROM Employee_1 e
JOIN Department2 d ON e.dept = d.deptid
JOIN Employee_1 m ON d.manager = m.empid
WHERE e.state = 'Florida' AND m.state = 'Florida';

Python

emp=pd.read_csv("../Employee_1.csv")
emp_s=pd.merge(emp,emp,left_on="partner",right_on="empid",suffixes=('', '_p'),how="left")
#1. Self-association
emp_s_name=emp_s[['name','name_p']]
dept2=pd.read_csv("../Department2.csv")
d_emp=pd.merge(dept2,emp,left_on="manager",right_on="empid")
#2. Cycle association
emp_d_emp=pd.merge(emp,d_emp,left_on="dept",right_on="deptid",suffixes=('', '_m'))
#2. Cycle association
Florida_emp_m=emp_d_emp[(emp_d_emp['state']=="Florida") & (emp_d_emp['state_m']=="Florida")].name.tolist()

8.2 Primary key association
Example codes for comparing SPL, SQL, and Python