"In data analysis, we usually encounter the scenarios of associating two or more tables, and .."

Hoo RaqForum 19 No.
1 Reply • 407 View • 2 Years ago

Python vs. SPL 10 -- One-to-N Association

In data analysis, we usually encounter the scenarios of associating two or more tables, and the association of tables can be divided into the following categories: one-to-one, one-to-many, many-to-one, and many-to-many. The association of one-to-one means that one record of a table corresponds to only one record of another table; the association of one-to-many indicates that one record of a table corresponds to multiple records of another table. This article will compare the computational abilities of Python and SPL in one-to-N association.

Homo-dimension association

One-to-one association is also known as homo dimension association, and the associated two tables are homo dimension tables to each other. For example:

Based on the final score table and the usual score table, calculate students’ scores of the semester (the percentages of final scores and usual scores are 0.7 and 0.3 respectively).

Final score table

studentid	fscore	…
s1087	68.36422	…
s1049	78.17515	…
s1018	82.61762	…
…	…	…

Usual score table

studentid	uscore	…
s1050	75.11157	…
s1039	88.02989	…
s1091	87.95997	…
…	…	…

Python

import pandas as pd

fexam_file="D:/data/FExamRes.csv"

uperf_file="D:/data/UPerformance.csv"

fexam=pd.read_csv(fexam_file)

uperf=pd.read_csv(uperf_file)

scores=pd.merge(fexam,uperf,on="studentid",how="outer").fillna(0)

scores["total"]=0.7*scores["fscore"]+0.3*scores["uscore"]

print(scores)

The scores of final exam

The usual scores

Add 0 after associating

Calculate the scores

The merge function in Python can associate two Dataframes. Some students may lack a certain score, so we use outer join, then the score is recorded as 0 when there is no score; in this way, the final score can be obtained after weighed sum. After associating, a new Dataframe will be generated in Python, in which data are copied. In short, this whole processing takes up both a lot of time and plenty of memory.

SPL

	A	B
1	D:/data/FExamRes.csv
2	D:/data/UPerformance.csv
3	=file(A1).import@tc()
4	=file(A2).import@tc()
5	=join@f(A3:fexam,studentid;A4:uperf,studentid)	Homo dimension association
6	=A5.new([fexam.studentid,uperf.studentid].conj().id().~:studentid,fexam.fscore:fscore,uperf.uscore:uscore,0.7fexam.fscore+0.3uperf.uscore:total)

The join function in SPL performs the association, @f option is outer join, and in A6 the final scores are calculated. The association in SPL only uses the primary key (studentid in this example) to create association on two tables and does not copy the data, so it occupies a little memory and performs very fast.

Multiple homo-dimension tables

During associative calculation, we may often encounter scenarios where three or more homo-dimension tables need to be associated. For example:

Apart from the final scores and usual scores, students also have the scores of elective courses. And the percentages of the three are 0.6, 0.3, and 0.1 respectively.

The final score table and usual score table are the same as the previous example, and the elective score table is shown below:

Elective score table

studentid	escore	…
s1051	74.66936	…
s1035	57.33134	…
s1047	61.29318	…
…	…	…

Python

#continue to use fexam and uperf

elec_file="D:/data/ElectiveScore.csv"

fexam=fexam.set_index(keys="studentid")

uperf=uperf.set_index(keys="studentid")

elec = pd.read_csv(elec_file,index_col='studentid')

student_score = pd.concat([fexam,uperf,elec],axis=1,join='outer',sort=True).fillna(0).reset_index()

student_score['total'] = 0.6*student_score.fscore+0.3*student_score.uscore+0.1*student_score.escore

print(student_score)

Set the index

Set the index while reading data

Associate multiple homo-dimension tables

Aggregate the scores

The merge function in Python can’t be used any longer for associating multiple homo-dimension tables because it only supports association of two tables. It will be a bit troublesome to execute the merge function twice when associating three tables. However, Python offers the concat function which can perform association based on the index, making association of multiple homo-dimension tables easier. However, in this way, we are required to master the usage of another function.

SPL

	A	B
…	/A3 is final score table, and A4 is usual score table
8	D:/data/ElectiveScore.csv
9	=file(A8).import@tc()
10	=A3.keys(studentid)	/set the primary key
11	=A4.keys(studentid)
12	=A9.keys(studentid)
13	=join@f(A3:fexam;A4:uperf;A9:elec)	/associate multiple homo-dimension tables
14	=A13.new([fexam.studentid,uperf.studentid,elec.studentid].conj().id().~:studentid,fexam.fscore:fscore,uperf.uscore:uscore,elec.escore:escore,0.6fexam.fscore+0.3uperf.uscore+0.1*escore:total)	/calculate the total scores

The join function in SPL can associate two tables as well as three or more tables. After the primary key is set, the function will perform association according to the primary key automatically, and the subsequent operation is the same as that of two-table association. In SPL, it is very easy for us to adopt one method to other operations, which sparing the trouble of memorizing the usage of other functions.

Primary-sub tables

One-to-many association indicates that one record of a table is able to correspond to any record of another table. In such association, we call the table at the “one” end as primary table, and the table at the “many” end as sub table. For example, in the order table and order detail table, one order corresponds to one distinct orderid, but every orderid may correspond to many pieces of order detail information. And the order table is primary table and order detail table is sub table.

Based on the above two tables, calculate the consumption of customers of each province in s150 stores.

Order table (the primary table of order detail table)

storeid	orderid	client	location	…
s101	o10001	c104231	Beijing	…
s101	o10002	c107032	Beijing	…
s101	o10003	c108875	Beijing	…
…	…	…	…	…

Order detail table (the sub table of order table)

storeid	orderid	productid	price	…
s101	o10001	p1078	1273	…
s101	o10002	p1091	131	…
s101	o10002	p1008	780	…
…	…	…	…	…

Python

detail_file="D:/data/Detail.csv"

order_file="D:/data/Order.csv"

detail=pd.read_csv(detail_file)

order=pd.read_csv(order_file)

detail150=detail.query("storeid=='s150'")

order150=order.query("storeid=='s150'")

order_detail=pd.merge(order150,detail150,on="orderid",how="inner")

loc_consume=order_detail.groupby("location").price.sum()

print(loc_consume)

The order details of s150 stores

The orders of s150 stores

Associate two tables

Group and aggregate

The association in this example is performed on two tables, and the associative relation is also very clear. We can just use orderid to associate two tables and then group and aggregate the records according to location. Here the operation is one-to-many association, which can be converted to one-to-one association as well:

1. Aggregate the consumption amount of each order in the order detail table;

2. Associate order table with the aggregation result of Step 1, then the association is about one-to-one;

3. Group and aggregate according to location. The code is written as:

1. order_sum=detail150.groupby(“orderid”).price.sum()

2. order_detail=pd.merge(order150, order_sum,on="orderid",how="inner")

3. loc_consume=order_detail.groupby("location").price.sum()

SPL

	A	B
…	…
16	D:/data/Detail.csv
17	D:/data/Order.csv
18	=file(A16).import@tc()
19	=file(A17).import@tc()
20	=A18.select(storeid=="s150")	/select
21	=A19.select(storeid=="s150")
22	=A20.groups(orderid;sum(price):sprice)	/group and aggregate
23	=join(A21:order,orderid;A22:consume,orderid)	/associate
24	=A23.groups(order.location;sum(consume.sprice):consume)	/group and aggregate

As for primary-sub tables association, the elaborately designed SPL advocates the method of grouping and aggregation -- homo-dimension association. Because in this way the associative relation is even clearer and the operation is less prone to errors when associated tables are too many and too complex. Also, SPL can associate first and aggregate later in the same way that Python does, and the code is like:

A25=join(A21:order,orderid;A20:detail,orderid)

A26=A25.groups(order.location;sum(detail.price):consume)

Multi-layer primary-sub tables

There are also primary-sub tables of multiple layers such as:

Apart form the same order table and order detail table of last example, a new store information table is added.

Store information table (primary table)

storeid	province	…
s101	Heilongjiang	…
s102	Beijing	…
s103	Tianjin	…
…	…	…

Based on the above three tables, aggregate the consumption of stores in each province.

Python

#continue to use detail and order tables

store_file="D:/data/Store.csv"

store = pd.read_csv(store_file)

detail_sum = detail.groupby(['storeid','orderid']).price.sum()

order_detai = pd.merge(order,detail_sum,on=['storeid','orderid'])

store_amount = order_detai.groupby('storeid',as_index=False).price.sum()

store_inf = pd.merge(store,store_amount,on=["storeid"])

province_sale = store_inf.groupby('province',as_index=False).price.sum()

print(province_sale)

Aggregate the sales amount of each order

Associate orders with the aggregated sales amount

Aggregate the sales amount of each store

Associate store information with sales amount of stores

Group and aggregate according to the provinces where stores locate

The order detail table is the sub table of order table, and order table is the sub table of store table, which form two layers of primary-sub tables together. The operation follows the method of aggregation on sub tables -- association on primary table, which make the associative relation clearer, and does not lead to too big associated tables or wrong association of many-to-many. This example shows a multi-field association, which can be easily done in Python as well.

SPL

	A	B
…	…
28	D:/data/Store.csv
29	=file(A28).import@tc()
30	=A18.groups(storeid,orderid;sum(price):sprice)	/aggregate the sales amount of each order
31	=join(A19:order,[storeid,orderid];A30:detail_sum)	/associate orders with aggregated sales amount of orders
32	=A31.groups(order.storeid;sum(detail_sum.sprice):sales)	/aggregate sales amount of each store
33	=join(A29:store,storeid;A32:store_amount)	/associate store information with sales amount of stores
34	=A33.groups(store.province;sum(store_amount.sales):sales)	/group and aggregate according to the provinces where stores locate

It is also easy for SPL to complete the association with the same method introduced previously.

One primary table & multiple sub tables

It is also very common that there is one primary table and multiple sub tables, such as order table, order detail table and order return table. For example:

Part of the order table, order detail table, and payment collection table are shown below:

Order table (primary table)

orderid	clientid	date	…
10012	100658	2019/2/13	…
10023	103478	2019/1/12	…
10040	108013	2019/1/4	…
…	…	…	…

Order detail table (sub table)

orderid	productid	price	…
10012	3018	428.5	…
10012	3019	349.2	…
10023	3019	349.2	…
…	…	…	…

Payment collection table (sub table)

orderid	term	payment	…
10012	1	259.2	…
10012	2	259.2	…
10012	3	259.3	…
…	…	…	…

Based on the above three tables, calculate the orders whose payments are not fully collected.

Python

detail_file="D:/data/Detail_2.csv"

order_file="D:/data/Order_2.csv"

pay_file="D:/data/Payment_2.csv"

detail_2 = pd.read_csv(detail_file)

order_2 = pd.read_csv(order_file,index_col='orderid')

pay = pd.read_csv(pay_file)

detail_order = detail_2.groupby('orderid').price.sum()

pay_order = pay.groupby('orderid').payment.sum()

order_d_p = order_2.join([pay_order,detail_order])

nopay_order = order_d_p.query('price>payment')

print(nopay_order)

Aggregate sales amount of orders

Aggregate amount of collected payments

Associate primary table with two sub tables

Select

Order detail table is the sub table of order table, and order return table is also the sub table of order table; so there is one primary table and two sub tables. This operation still follows the method of aggregation on sub tables -- association on primary table. But Python does differently to complete the operation because the merge function cannot associate three tables simultaneously, instead, the join function is provided to associate three tables using index. It creates a new wide table, and uses the wide table to select the result, whose associative logic is slightly different from the previous one.

SPL

	A	B
…	…
36	D:/data/Detail_2.csv
37	D:/data/Order_2.csv
38	D:/data/Payment_2.csv
39	=file(A36).import@tc()
40	=file(A37).import@tc()
41	=file(A38).import@tc()
42	=A39.groups(orderid;sum(price):amount)	/aggregate sales amount of orders
43	=A41.groups(orderid;sum(payment):payment)	/aggregate amount of collected payments
44	=join(A40:order,orderid;A42:amount;A43:payment)	Associate the primary table with two sub tables
45	=A44.select(amount.amount>payment.payment).(order)	/select

SPL still follows the logic described before, using the join function to associate multiple tables step by step, then just displaying the order table after selecting.

Summary

When performing association, Python copies two tables to create a new wide table, and uses the wide table to complete the subsequent operations, which consumes more memory and degrades the efficiency. Besides, when associating multiple tables, Python needs to use other functions and some particular conditions, otherwise, it has to use the merge function many times, which is a bit cumbersome.

While SPL only establishes associative relation on two tables rather than copying data, and uses the relation to complete the operation quickly, which occupies less memory and achieves high efficiency. In addition, when associating multiple tables, SPL can still use the same method of two-table association to deal with N tables without complex detour.

When associating primary table with sub tables (one-to-many), we first aggregate the sub tables and then associate them with primary table, which forms the homo-dimension association, so they are introduced together.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Hoo • 407 View • 2 Years ago

Python vs. SPL 10 -- One-to-N Association

Homo-dimension association

Multiple homo-dimension tables

Primary-sub tables

Multi-layer primary-sub tables

One primary table & multiple sub tables

Summary

ToC