MongoDB Data Grouping & Aggregationate
The document-based NoSQL database MongoDB is popular worldwide thanks to its powerful processing ability. Besides the basic insert, delete, update and query, it also supports the relatively complicated grouping and aggregate operation. Yet in computing field there is always a better option. Here we’ll compare MongoDB method and SPL (Structured Process Language) way of grouping and summarizing data stored in MongoDB in several scenarios.
1. Nested array query
Task : Calculate average score of all subjects and each student’s total score.
Sample data:
_id | name | sex | Scroe |
---|---|---|---|
1 | Tom | F | [{“lesson”:"Physics",“mark”:60 },{“lesson”:"Chemical",“mark”:72}] |
2 | Jerry | M | [{“lesson”:"Physics",“mark”:92 },{“lesson”:"Math",“mark”:81}] |
Expected result:
Physics | 76 |
Tom | 132 |
Chemical | 72 |
Jerry | 173 |
Math | 81 |
Mongodb script:
The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed using $unwind operator and group() method.
SPL script (student.dfx):
A | |
---|---|
1 | =mongo_open(“mongodb://127.0.0.1:27017/raqdb”) |
2 | =mongo_shell(A1,“student.find()”).fetch() |
3 | =A2.conj(scroe).groups(lesson:LESSON;avg(mark):AVG) |
4 | =A2.new(name:NAME,scroe.sum(mark):TOTAL) |
5 | >A1.close() |
Average score for each subject:
LESSON | AVG |
Chemical | 72.0 |
Math | 81.0 |
Physics | 76.0 |
Total score of each student:
NAME | TOTAL |
Tom | 132 |
Jerry | 173 |
Script explanation:
A1: Connect to mongodb database.
A2: Read in data from student collection.
A3: Split score field and union it into a table sequence, and group the table by subject to calculate the average score.
A4: Calculate each student’s total score and return a new table sequence consisting of NAME and TOTAL fields. new() function creates a new table sequence.
A5: Close mongodb connection.
2. Nested document query
Task: Calculate sum of quantities in every income value and output value in each document.
Sample data :
_id | income | output |
1 | {“cpu”:1000, “mem”:500, “mouse”:“100”} | {“cpu”:1000, “mem”:600 ,“mouse”:“120”} |
2 | {“cpu”:2000, “mem”:1000, “mouse”:“50”,“mainboard”:500 } | {“cpu”:1500, “mem”:300} |
Expected result :
_id | income | output |
1 | 1600 | 1720 |
2 | 3550 | 1800 |
Mongodb script :
var fields = [ "income", "output"];
db.computer.aggregate([
{
$project:{
"values":{
$filter:{
input:{
"\$objectToArray":"$$ROOT"
},
cond:{
$in:[
"$$this.k",
fields
]
}
}
}
}
},
{
\$unwind:"$values"
},
{
$project:{
key:"$values.k",
values:{
"$sum":{
"$let":{
"vars":{
"item":{
"\$objectToArray":"$values.v"
}
},
"in":"$$item.v"
}
}
}
}
},
{$sort: {"_id":-1}},
{ "$group": {
"_id": "$_id",
'income':{"\$first": "$values"},
"output":{"\$last": "$values"}
}},
]);
filter operator stores income field and output field as arrays. $unwind operator split each array into a record. $sum operator calculates sum of members. $group operator groups data by _id and joins up values.
SPL script:
A | |
---|---|
1 | =mongo_open(“mongodb://127.0.0.1:27017/raqdb”) |
2 | =mongo_shell(A1,“computer.find()”).fetch() |
3 | =A2.new(_id:ID,income.array().sum():INCOME,output.array().sum():OUTPUT) |
4 | >A1.close() |
Result:
ID | INCOME | OUTPUT |
1 | 1600.0 | 1720.0 |
2 | 3550.0 | 1800.0 |
Script explanation:
A1: Connect to mongodb database.
A2: Read in data from computer collection.
A3: Split each income value and output value into a sequence and perform sum and join results up with corresponding ID to form a new table sequence.
A4: Close database connection.
SPL gets field values from each subdocument and calculates their sum. This is much simpler. Beside, a nested document looks like a nested array at first sight, so be careful when you write the script.
3. Group and aggregate by segments
Task: Divide the following data by specified intervals of sales and count records in each segment.
Sample data:
_id | NAME | STATE | SALES |
1 | Ashley | New York | 11000 |
2 | Rachel | Montana | 9000 |
3 | Emily | New York | 8800 |
4 | Matthew | Texas | 8000 |
5 | Alexis | Illinois | 14000 |
Grouping conditions: [0,3000);[3000,5000);[5000,7500);[7500,10000);[10000, ∞)
Expected result :
Segment | number |
3 | 3 |
4 | 2 |
MongoDB script:
var a_count=0;
var b_count=0;
var c_count=0;
var d_count=0;
var e_count=0;
db.sales.find({
}).forEach(
function(myDoc) {
if (myDoc.SALES <3000) {
a_count += 1;
}
else if (myDoc.SALES <5000) {
b_count += 1;
}
else if (myDoc.SALES <7500) {
c_count += 1;
}
else if (myDoc.SALES <10000) {
d_count += 1;
}
else {
e_count += 1;
}
}
);
print("a_count="+a_count)
print("b_count="+b_count)
print("c_count="+c_count)
print("d_count="+d_count)
print("e_count="+e_count)
MongoDB lacks related API to handle grouping according to specified intervals and the hardcoding process is roundabout. The above script is just one of the Mongo solutions.
SPL script:
A | |
---|---|
1 | [3000,5000,7500,10000,15000] |
2 | =mongo_open(“mongodb://127.0.0.1:27017/raqdb”) |
3 | =mongo_shell(A2,“sales.find()”).fetch() |
4 | =A3.groups(A1.pseg(int(~.SALES)):Segment;count(1): number) |
5 | >A2.close() |
Script explanation:
A1: Define intervals by which SALES is segmented.
A2: Connect to mongodb database.
A3: Read in data from sales table.
A4: Group records by sales intervals and count records in each group. pseg()function returns the sequence number of a group where a records falls in. int() function converts data into an integer.
A5: Close Mongodb connection.
Both scripts get the desired result, but SPL script is more concise by using pseg() function.
4. Multilevel group and aggregate
Task: Group the following data by addr and then by book and count books for each addr and those for each book type.
addr | book |
address1 | book1 |
address2 | book1 |
address1 | book5 |
address3 | book9 |
address2 | book5 |
address2 | book1 |
address1 | book1 |
address15 | book1 |
address4 | book3 |
address5 | book1 |
address7 | book11 |
address1 | book1 |
Expected result:
_id | Total | books | Count |
address1 | 4 | book1 | 3 |
book5 | 1 | ||
address15 | 1 | book1 | 1 |
address2 | 3 | book1 | 2 |
book5 | 1 | ||
address3 | 1 | book9 | 1 |
address4 | 1 | book3 | 1 |
address5 | 1 | book1 | 1 |
address7 | 1 | book11 | 1 |
MongoDB script :
db.books.aggregate([
{ "$group": {
"_id": {
"addr": "$addr",
"book": "$book"
},
"bookCount": {"\$sum": 1}
}},
{ "$group": {
"_id": "$_id.addr",
"books": {
"$push": {
"book": "$_id.book",
"count": "$bookCount"
},
},
"count": {"\$sum": "$bookCount"}
}},
{"$sort": { "count": -1} },
{ "$project": {
"books": {"\$slice": [ "$books", 2] },
"count": 1
}}
]).pretty()
The above script groups data by addr,book and count books, and then re-group data by addr to count books, and display data in desired format.
SPL script (books.dfx):
A | B | |
---|---|---|
1 | =mongo_open(“mongodb://127.0.0.1:27017/raqdb”) | |
2 | =mongo_shell(A1,“books.find()”) | |
3 | =A2.groups(addr,book;count(book): Count) | |
4 | =A3.groups(addr;sum(Count):Total) | |
5 | =A3.join(addr,A4:addr,Total) | return A5 |
6 | >A1.close() |
Result:
Address | book | Count | Total |
address1 | book1 | 3 | 4 |
address1 | book5 | 1 | 4 |
address15 | book1 | 1 | 1 |
address2 | book1 | 2 | 3 |
address2 | book5 | 1 | 3 |
address3 | book9 | 1 | 1 |
address4 | book3 | 1 | 1 |
address5 | book1 | 1 | 1 |
address7 | book11 | 1 | 1 |
Script explanation:
A1: Connect to mongodb database.
A2: Read in data from books collection.
A3: Group data by addr&book and count books in each group.
A4: Re-group data by addr to count books in each group.
A5: Add A4’s Total field to A3’s table sequence by addr values.
B5: Return A5’s table sequence.
A6: Close database connection.
The above SPL script is both concise and integration-friendly.
To handle the grouping and aggregation in Java, you need to implement the requirement from the low level. The code is complicated and can’t be used to handle other scenarios. esProc offers JDBC interface to enable a Java application to call the SPL script in the manner of calling a stored procedure. See esProc JDBC Basic Uses to learn more.
Here’s how Java calls an SPL script:
public void testStudent (){
Connection con = null;
com.esproc.jdbc.InternalCStatement st;
try{
// Establish a connection
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// Call the stored procedure, where books is the SPL script (with dfx extension) name
st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call books ()");
// Execute the stored procedure
st.execute();
// Get the result set
ResultSet rs = st.getResultSet();
catch(Exception e){
System.out.println(e);
}
It’s easy to use the result of executing an SPL script by a Java application. You can also load the SPL script directly and use its name as a parameter to execute it via a function. esProc supports ODBC, too. The integration via ODBC is similar.
MongoDB’s schemaless document-based structure enables storage convenient and flexible data presentation and query. On the other hand, it increases the complexity of a shell script because there are details to consider and it needs many steps to get to the goal. By contrast, SPL streamlines the process, generates concise and simple code, and increases efficiency.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version