"MongoDB is a popular NoSQL database that can store a huge volume of data. There is a common need .."

Hoo RaqForum 19 No.
1 Reply • 781 View • 5 Years ago

Getting Top N from MongoDB Data

MongoDB is a popular NoSQL database that can store a huge volume of data. There is a common need in statistical projects: sort a large number of objects and only get the top N, such as getting the latest values of a target field. In this article we’ll illustrate how to handle this in SPL.
Collection last3 has two fields: variable and timestamp. Task: group documents by variable, get documents with the latest 3 timestamp, and then find the document with the earliest timestamp.
Here’s part of the source data:

{"_id" :  ObjectId("54f69645e4b077ed8d997857"),"variable" :  "A", "timestamp" : ISODate("1995-01-01T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997856"),"variable" :  "A", "timestamp" : ISODate("1995-01-02T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997855"),"variable" :  "A", "timestamp" : ISODate("1995-01-03T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997854"),"variable" :  "B", "timestamp" : ISODate("1995-01-02T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997853"),"variable" :  "B", "timestamp" : ISODate("1995-01-01T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997852"),"variable" :  "B", "timestamp" : ISODate("1994-01-03T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997851"),"variable" :  "C", "timestamp" : ISODate("1994-01-03T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997850"),"variable" :  "C", "timestamp" : ISODate("1994-01-02T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997858"),"variable" :  "C", "timestamp" : ISODate("1994-01-01T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997859"),"variable" :  "C", "timestamp" : ISODate("1993-01-01T00:00:00Z")}

SPL script:

	A	B
1	=mongo_open(“mongodb://localhost:27017/local?user=test&password=test”)
2	=mongo_shell(A1,“last3.find(,{_id:0};{variable:1})”)
3	for A2;variable	=A3.top(3;-timestamp)
4		=@\|B3
5	=B4.minp(~.timestamp)
6	=mongo_close(A1)

A1: Connect to MongoDB database. The connection string’s format is mongo://ip:port/db?arg=value&….
A2: find() function retrieves documents from collection last3, sorts them and return result as a cursor. As no filtering condition is specified, it gets all fields except _id and sorts them by variable.
A3: Fetch a set of documents with same variable field value from A2’s cursor each time and do this iteratively. A3’s working scope covers an indented area from B3 to B4, within which A3 references a loop variable. A3 stores data in the memory and we can view a fetch result in debugging mode:

variable	timestamp
C	1994-01-03 08:00:00
C	1994-01-02 08:00:00
C	1994-01-01 08:00:00
C	1993-01-01 08:00:00

B3: Select documents with the latest (largest) timestamp.
B4: Append B3’s result to B4. Here’s B4’’s result:

variable	timestamp
A	1995-01-03 08:00:00
A	1995-01-02 08:00:00
A	1995-01-01 08:00:00
B	1995-01-02 08:00:00
B	1995-01-01 08:00:00
B	1994-01-03 08:00:00
C	1994-01-03 08:00:00
C	1994-01-02 08:00:00
C	1994-01-01 08:00:00

A5: Find the document with the earliest (smallest) timestamp from B4:

variable	timestamp
C	1994-01-01 08:00:00

A6: Close MongoDB connection.
By simplifying the implementation of mongodb shell, it’s much easier to get topN from MongoDB data in SPL.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Hoo • 781 View • 5 Years ago