Python vs. SPL 10 -- One-to-N Association
In data analysis, we usually encounter the scenarios of associating two or more tables, and the association of tables can be divided into the following categories: one-to-one, one-to-many, many-to-one, and many-to-many. The association of one-to-one means that one record of a table corresponds to only one record of another table; the association of one-to-many indicates that one record of a table corresponds to multiple records of another table. This article will compare the computational abilities of Python and SPL in one-to-N association.
Homo-dimension association
One-to-one association is also known as homo dimension association, and the associated two tables are homo dimension tables to each other. For example:
Based on the final score table and the usual score table, calculate students’ scores of the semester (the percentages of final scores and usual scores are 0.7 and 0.3 respectively).
Final score table
studentid |
fscore |
… |
s1087 |
68.36422 |
… |
s1049 |
78.17515 |
… |
s1018 |
82.61762 |
… |
… |
… |
… |
Usual score table
studentid |
uscore |
… |
s1050 |
75.11157 |
… |
s1039 |
88.02989 |
… |
s1091 |
87.95997 |
… |
… |
… |
… |
Python
import pandas as pd fexam_file="D:/data/FExamRes.csv" uperf_file="D:/data/UPerformance.csv" fexam=pd.read_csv(fexam_file) uperf=pd.read_csv(uperf_file) scores=pd.merge(fexam,uperf,on="studentid",how="outer").fillna(0) scores["total"]=0.7*scores["fscore"]+0.3*scores["uscore"] print(scores) |
The scores of final exam The usual scores Add 0 after associating Calculate the scores
|
The merge function in Python can associate two Dataframes. Some students may lack a certain score, so we use outer join, then the score is recorded as 0 when there is no score; in this way, the final score can be obtained after weighed sum. After associating, a new Dataframe will be generated in Python, in which data are copied. In short, this whole processing takes up both a lot of time and plenty of memory.
SPL
A |
B |
|
1 |
D:/data/FExamRes.csv |
|
2 |
D:/data/UPerformance.csv |
|
3 |
=file(A1).import@tc() |
|
4 |
=file(A2).import@tc() |
|
5 |
=join@f(A3:fexam,studentid;A4:uperf,studentid) |
Homo dimension association |
6 |
=A5.new([fexam.studentid,uperf.studentid].conj().id().~:studentid,fexam.fscore:fscore,uperf.uscore:uscore,0.7*fexam.fscore+0.3*uperf.uscore:total) |
The join function in SPL performs the association, @f option is outer join, and in A6 the final scores are calculated. The association in SPL only uses the primary key (studentid in this example) to create association on two tables and does not copy the data, so it occupies a little memory and performs very fast.
Multiple homo-dimension tables
During associative calculation, we may often encounter scenarios where three or more homo-dimension tables need to be associated. For example:
Apart from the final scores and usual scores, students also have the scores of elective courses. And the percentages of the three are 0.6, 0.3, and 0.1 respectively.
The final score table and usual score table are the same as the previous example, and the elective score table is shown below:
Elective score table
studentid |
escore |
… |
s1051 |
74.66936 |
… |
s1035 |
57.33134 |
… |
s1047 |
61.29318 |
… |
… |
… |
… |
Python
#continue to use fexam and uperf elec_file="D:/data/ElectiveScore.csv" fexam=fexam.set_index(keys="studentid") uperf=uperf.set_index(keys="studentid") elec = pd.read_csv(elec_file,index_col='studentid') student_score = pd.concat([fexam,uperf,elec],axis=1,join='outer',sort=True).fillna(0).reset_index() student_score['total'] = 0.6*student_score.fscore+0.3*student_score.uscore+0.1*student_score.escore print(student_score) |
Set the index
Set the index while reading data Associate multiple homo-dimension tables
Aggregate the scores
|
The merge function in Python can’t be used any longer for associating multiple homo-dimension tables because it only supports association of two tables. It will be a bit troublesome to execute the merge function twice when associating three tables. However, Python offers the concat function which can perform association based on the index, making association of multiple homo-dimension tables easier. However, in this way, we are required to master the usage of another function.
SPL
A |
B |
|
… |
/A3 is final score table, and A4 is usual score table |
|
8 |
D:/data/ElectiveScore.csv |
|
9 |
=file(A8).import@tc() |
|
10 |
=A3.keys(studentid) |
/set the primary key |
11 |
=A4.keys(studentid) |
|
12 |
=A9.keys(studentid) |
|
13 |
=join@f(A3:fexam;A4:uperf;A9:elec) |
/associate multiple homo-dimension tables |
14 |
=A13.new([fexam.studentid,uperf.studentid,elec.studentid].conj().id().~:studentid,fexam.fscore:fscore,uperf.uscore:uscore,elec.escore:escore,0.6*fexam.fscore+0.3*uperf.uscore+0.1*escore:total) |
/calculate the total scores |
The join function in SPL can associate two tables as well as three or more tables. After the primary key is set, the function will perform association according to the primary key automatically, and the subsequent operation is the same as that of two-table association. In SPL, it is very easy for us to adopt one method to other operations, which sparing the trouble of memorizing the usage of other functions.
Primary-sub tables
One-to-many association indicates that one record of a table is able to correspond to any record of another table. In such association, we call the table at the “one” end as primary table, and the table at the “many” end as sub table. For example, in the order table and order detail table, one order corresponds to one distinct orderid, but every orderid may correspond to many pieces of order detail information. And the order table is primary table and order detail table is sub table.
Based on the above two tables, calculate the consumption of customers of each province in s150 stores.
Order table (the primary table of order detail table)
storeid |
orderid |
client |
location |
… |
s101 |
o10001 |
c104231 |
Beijing |
… |
s101 |
o10002 |
c107032 |
Beijing |
… |
s101 |
o10003 |
c108875 |
Beijing |
… |
… |
… |
… |
… |
… |
Order detail table (the sub table of order table)
storeid |
orderid |
productid |
price |
… |
s101 |
o10001 |
p1078 |
1273 |
… |
s101 |
o10002 |
p1091 |
131 |
… |
s101 |
o10002 |
p1008 |
780 |
… |
… |
… |
… |
… |
… |
Python
detail_file="D:/data/Detail.csv" order_file="D:/data/Order.csv" detail=pd.read_csv(detail_file) order=pd.read_csv(order_file) detail150=detail.query("storeid=='s150'") order150=order.query("storeid=='s150'") order_detail=pd.merge(order150,detail150,on="orderid",how="inner") loc_consume=order_detail.groupby("location").price.sum() print(loc_consume) |
The order details of s150 stores The orders of s150 stores Associate two tables Group and aggregate
|
The association in this example is performed on two tables, and the associative relation is also very clear. We can just use orderid to associate two tables and then group and aggregate the records according to location. Here the operation is one-to-many association, which can be converted to one-to-one association as well:
1. Aggregate the consumption amount of each order in the order detail table;
2. Associate order table with the aggregation result of Step 1, then the association is about one-to-one;
3. Group and aggregate according to location. The code is written as:
1. order_sum=detail150.groupby(“orderid”).price.sum()
2. order_detail=pd.merge(order150, order_sum,on="orderid",how="inner")
3. loc_consume=order_detail.groupby("location").price.sum()
SPL
A |
B |
|
… |
… |
|
16 |
D:/data/Detail.csv |
|
17 |
D:/data/Order.csv |
|
18 |
=file(A16).import@tc() |
|
19 |
=file(A17).import@tc() |
|
20 |
=A18.select(storeid=="s150") |
/select |
21 |
=A19.select(storeid=="s150") |
|
22 |
=A20.groups(orderid;sum(price):sprice) |
/group and aggregate |
23 |
=join(A21:order,orderid;A22:consume,orderid) |
/associate |
24 |
=A23.groups(order.location;sum(consume.sprice):consume) |
/group and aggregate |
As for primary-sub tables association, the elaborately designed SPL advocates the method of grouping and aggregation -- homo-dimension association. Because in this way the associative relation is even clearer and the operation is less prone to errors when associated tables are too many and too complex. Also, SPL can associate first and aggregate later in the same way that Python does, and the code is like:
A25=join(A21:order,orderid;A20:detail,orderid)
A26=A25.groups(order.location;sum(detail.price):consume)
Multi-layer primary-sub tables
There are also primary-sub tables of multiple layers such as:
Apart form the same order table and order detail table of last example, a new store information table is added.
Store information table (primary table)
storeid |
province |
… |
s101 |
Heilongjiang |
… |
s102 |
Beijing |
… |
s103 |
Tianjin |
… |
… |
… |
… |
Based on the above three tables, aggregate the consumption of stores in each province.
Python
#continue to use detail and order tables store_file="D:/data/Store.csv" store = pd.read_csv(store_file) detail_sum = detail.groupby(['storeid','orderid']).price.sum() order_detai = pd.merge(order,detail_sum,on=['storeid','orderid']) store_amount = order_detai.groupby('storeid',as_index=False).price.sum() store_inf = pd.merge(store,store_amount,on=["storeid"]) province_sale = store_inf.groupby('province',as_index=False).price.sum() print(province_sale) |
Aggregate the sales amount of each order
Associate orders with the aggregated sales amount Aggregate the sales amount of each store
Associate store information with sales amount of stores Group and aggregate according to the provinces where stores locate
|
The order detail table is the sub table of order table, and order table is the sub table of store table, which form two layers of primary-sub tables together. The operation follows the method of aggregation on sub tables -- association on primary table, which make the associative relation clearer, and does not lead to too big associated tables or wrong association of many-to-many. This example shows a multi-field association, which can be easily done in Python as well.
SPL
A |
B |
|
… |
… |
|
28 |
D:/data/Store.csv |
|
29 |
=file(A28).import@tc() |
|
30 |
=A18.groups(storeid,orderid;sum(price):sprice) |
/aggregate the sales amount of each order |
31 |
=join(A19:order,[storeid,orderid];A30:detail_sum) |
/associate orders with aggregated sales amount of orders |
32 |
=A31.groups(order.storeid;sum(detail_sum.sprice):sales) |
/aggregate sales amount of each store |
33 |
=join(A29:store,storeid;A32:store_amount) |
/associate store information with sales amount of stores |
34 |
=A33.groups(store.province;sum(store_amount.sales):sales) |
/group and aggregate according to the provinces where stores locate |
It is also easy for SPL to complete the association with the same method introduced previously.
One primary table & multiple sub tables
It is also very common that there is one primary table and multiple sub tables, such as order table, order detail table and order return table. For example:
Part of the order table, order detail table, and payment collection table are shown below:
Order table (primary table)
orderid |
clientid |
date |
… |
10012 |
100658 |
2019/2/13 |
… |
10023 |
103478 |
2019/1/12 |
… |
10040 |
108013 |
2019/1/4 |
… |
… |
… |
… |
… |
Order detail table (sub table)
orderid |
productid |
price |
… |
10012 |
3018 |
428.5 |
… |
10012 |
3019 |
349.2 |
… |
10023 |
3019 |
349.2 |
… |
… |
… |
… |
… |
Payment collection table (sub table)
orderid |
term |
payment |
… |
10012 |
1 |
259.2 |
… |
10012 |
2 |
259.2 |
… |
10012 |
3 |
259.3 |
… |
… |
… |
… |
… |
Based on the above three tables, calculate the orders whose payments are not fully collected.
Python
detail_file="D:/data/Detail_2.csv" order_file="D:/data/Order_2.csv" pay_file="D:/data/Payment_2.csv" detail_2 = pd.read_csv(detail_file) order_2 = pd.read_csv(order_file,index_col='orderid') pay = pd.read_csv(pay_file) detail_order = detail_2.groupby('orderid').price.sum() pay_order = pay.groupby('orderid').payment.sum() order_d_p = order_2.join([pay_order,detail_order]) nopay_order = order_d_p.query('price>payment') print(nopay_order) |
Aggregate sales amount of orders Aggregate amount of collected payments Associate primary table with two sub tables Select |
Order detail table is the sub table of order table, and order return table is also the sub table of order table; so there is one primary table and two sub tables. This operation still follows the method of aggregation on sub tables -- association on primary table. But Python does differently to complete the operation because the merge function cannot associate three tables simultaneously, instead, the join function is provided to associate three tables using index. It creates a new wide table, and uses the wide table to select the result, whose associative logic is slightly different from the previous one.
SPL
A |
B |
|
… |
… |
|
36 |
D:/data/Detail_2.csv |
|
37 |
D:/data/Order_2.csv |
|
38 |
D:/data/Payment_2.csv |
|
39 |
=file(A36).import@tc() |
|
40 |
=file(A37).import@tc() |
|
41 |
=file(A38).import@tc() |
|
42 |
=A39.groups(orderid;sum(price):amount) |
/aggregate sales amount of orders |
43 |
=A41.groups(orderid;sum(payment):payment) |
/aggregate amount of collected payments |
44 |
=join(A40:order,orderid;A42:amount;A43:payment) |
Associate the primary table with two sub tables |
45 |
=A44.select(amount.amount>payment.payment).(order) |
/select |
SPL still follows the logic described before, using the join function to associate multiple tables step by step, then just displaying the order table after selecting.
Summary
When performing association, Python copies two tables to create a new wide table, and uses the wide table to complete the subsequent operations, which consumes more memory and degrades the efficiency. Besides, when associating multiple tables, Python needs to use other functions and some particular conditions, otherwise, it has to use the merge function many times, which is a bit cumbersome.
While SPL only establishes associative relation on two tables rather than copying data, and uses the relation to complete the operation quickly, which occupies less memory and achieves high efficiency. In addition, when associating multiple tables, SPL can still use the same method of two-table association to deal with N tables without complex detour.
When associating primary table with sub tables (one-to-many), we first aggregate the sub tables and then associate them with primary table, which forms the homo-dimension association, so they are introduced together.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version