Getting Top N from MongoDB Data
MongoDB is a popular NoSQL database that can store a huge volume of data. There is a common need in statistical projects: sort a large number of objects and only get the top N, such as getting the latest values of a target field. In this article we’ll illustrate how to handle this in SPL.
Collection last3 has two fields: variable and timestamp. Task: group documents by variable, get documents with the latest 3 timestamp, and then find the document with the earliest timestamp.
Here’s part of the source data:
{"_id" : ObjectId("54f69645e4b077ed8d997857"),"variable" : "A", "timestamp" : ISODate("1995-01-01T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997856"),"variable" : "A", "timestamp" : ISODate("1995-01-02T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997855"),"variable" : "A", "timestamp" : ISODate("1995-01-03T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997854"),"variable" : "B", "timestamp" : ISODate("1995-01-02T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997853"),"variable" : "B", "timestamp" : ISODate("1995-01-01T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997852"),"variable" : "B", "timestamp" : ISODate("1994-01-03T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997851"),"variable" : "C", "timestamp" : ISODate("1994-01-03T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997850"),"variable" : "C", "timestamp" : ISODate("1994-01-02T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997858"),"variable" : "C", "timestamp" : ISODate("1994-01-01T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997859"),"variable" : "C", "timestamp" : ISODate("1993-01-01T00:00:00Z")}
SPL script:
A | B | |
---|---|---|
1 | =mongo_open(“mongodb://localhost:27017/local?user=test&password=test”) | |
2 | =mongo_shell(A1,“last3.find(,{_id:0};{variable:1})”) | |
3 | for A2;variable | =A3.top(3;-timestamp) |
4 | =@|B3 | |
5 | =B4.minp(~.timestamp) | |
6 | =mongo_close(A1) |
A1: Connect to MongoDB database. The connection string’s format is mongo://ip:port/db?arg=value&….
A2: find() function retrieves documents from collection last3, sorts them and return result as a cursor. As no filtering condition is specified, it gets all fields except _id and sorts them by variable.
A3: Fetch a set of documents with same variable field value from A2’s cursor each time and do this iteratively. A3’s working scope covers an indented area from B3 to B4, within which A3 references a loop variable. A3 stores data in the memory and we can view a fetch result in debugging mode:
variable | timestamp |
C | 1994-01-03 08:00:00 |
C | 1994-01-02 08:00:00 |
C | 1994-01-01 08:00:00 |
C | 1993-01-01 08:00:00 |
B3: Select documents with the latest (largest) timestamp.
B4: Append B3’s result to B4. Here’s B4’’s result:
variable | timestamp |
A | 1995-01-03 08:00:00 |
A | 1995-01-02 08:00:00 |
A | 1995-01-01 08:00:00 |
B | 1995-01-02 08:00:00 |
B | 1995-01-01 08:00:00 |
B | 1994-01-03 08:00:00 |
C | 1994-01-03 08:00:00 |
C | 1994-01-02 08:00:00 |
C | 1994-01-01 08:00:00 |
A5: Find the document with the earliest (smallest) timestamp from B4:
variable | timestamp |
C | 1994-01-01 08:00:00 |
A6: Close MongoDB connection.
By simplifying the implementation of mongodb shell, it’s much easier to get topN from MongoDB data in SPL.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version