Philipp Burckhardt

On Statistics, Programming and the Social Sciences

Queries in mongoDB

At first, I found writing mongo queries quite unintuitive, as one cannot just write a query and execute it as in SQL, but rather has to construct a JavaScript object and pass it to the find function. So in this post I would like to go through the main types of queries one regularly encounters and how they are handled in mongo. We will consider the example of a database consisting of information about movies filmed in the 2000s. Let us open the shell and use the not yet existing database movieDB as follows:

use movieDB

This command changes the scope to the movieDB database, which if not yet existing will be created along the way. The command show collections prints out all the existing collections in this database. As we have not created any collections, this command will not print out anything yet.

Let us now include a few movies into a collection called movie by using the insert command:

db.movie.insert({"name":"Gladiator","director":"Ridley Scott","year":2000})

Let us deconstruct this command: We start with db., then we specify which collection to use, in this case movie, and then we invoke the insert method of this collection. This function expects as an argument the documents which should be inserted into the collection. Documents in mongoDB are JSON-like objects, with a few differences which we are not going to bother with here.
Observe that we want to insert the movie "Gladiator", which was directed by Ridley Scott and released in the year 2000. To populate our database a bit more, we will include three other movies. We can insert multiple documents at once by passing them as an Array to the insert function:

db.movie.insert([{"name":"The Dark Knight","director":"Christopher Nolan","year":2008},
{"name":"The Prestige","director":"Christopher Nolan","year":2006},{"name":"Avatar","director":"James Cameron","year":2009}
])

To retrieve documents from a collection, we use the
db.collection.find method. When invoked without any arguments, it returns all documents:

db.movie.find()

This is equivalent to passing an empty object to the find function, as in db.movie.find({}). As you can see, the returned documents all contain an automatically created \_id field, which serves as the unique identifier for each document:

{ "_id" : ObjectId("5399a4717147a9de6c4ab09f"), "name" : "Gladiator", "director" : "Ridley Scott", "year" : 2000 }

{ "_id" : ObjectId("5399a5677147a9de6c4ab0a0"), "name" : "The Dark Knight", "director" : "Christopher Nolan", "year" : 20
08 }

{ "_id" : ObjectId("5399a5677147a9de6c4ab0a1"), "name" : "The Prestige", "director" : "Christopher Nolan", "year" : 2006}

{ "_id" : ObjectId("5399a5677147a9de6c4ab0a2"), "name" : "Avatar", "director" : "James Cameron", "year" : 2009 }

If we instead only want to retrieve the first result, we can use db.movie.findOne({}) instead. We can get the same result by limiting the result set of find to only one document via the limit function as in

db.movie.find().limit(1)

What if we were intereted in retrieving the document stored at a specific position in the collection? The solution is to append .skip(i) to find, where is any integer, we specify to skip the first i elements and only return the results after them. We can also combine skip and limit. For example,

db.movie.find().limit(2).skip(1)

returns

{ "_id" : ObjectId("5399a5677147a9de6c4ab0a0"), "name" : "The Dark Knight", "director" : "Christopher Nolan", "year" : 20
08 }

{ "_id" : ObjectId("5399a5677147a9de6c4ab0a1"), "name" : "The Prestige", "director" : "Christopher Nolan", "year" : 2006
}

that is it skips the first document in the movie collection and then returns the two ones stored afterwards. In general, the position in the collection is not informative, and documents might even change places when being updated: The reason is that when we append data to a given document, the pre-allocated space at the position of the document might not be sufficient and so the documents would be placed at the end of the collection. To demonstrate this, let us first retrieve the Gladiator movie and append an array of actors. We can assign the result of a findOne query directly to a JavaScript variable:


 var glad = db.movie.findOne();
 glad.actors = ["Russell Crowe",  "Joaquin Phoenix", "Connie Nielsen"];
 

Okay, fine. Inspecting the object, we see that it now indeed contains an Array of the main actors of the movie:

{
        "_id" : ObjectId("5399a4717147a9de6c4ab09f"),
        "name" : "Gladiator",
        "director" : "Ridley Scott",
        "year" : 2000,
        "actors" : [
                "Russell Crowe",
                "Joaquin Phoenix",
                "Connie Nielsen"
        ]
} 

To save a document, one uses db.connection.save. If the document is already present in the collection, its contents are modified and no duplicate is stored:

db.movie.save(glad); 

One gets a message which shows that the change was successful

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 }),

but if in doubt we can just retrieve the document again and check that its contents have indeed changed. Calling
db.movie.findOne(), we get

{
        "_id" : ObjectId("5399a5677147a9de6c4ab0a0"),
        "name" : "The Dark Knight",
        "director" : "Christoper Nolan",
        "year" : 2008
}

Oh, what happened here? The movie at the first position is not Gladiator anymore, but instead The Dark Knight. We have already talked about the reason for this behaviour: The pre-allocated space for each document was not big enough, and so the modified Gladiator movie had to be placed at the end of the collection. Hene, we cannot use the place in the collection as a substitute for an id when retrieving a document as it is not constant. Instead, it is best to use the \_id of the movie if we want to retrieve a specific one. But how do we write a query in mongo, now?

We do this by passing a different object to find than the empty one. For example, to match a given key, we just supply an object which has the same key and the value which we want to search for. For example, to retrieve the movie Gladiator, we could call

db.movie.find({"name":"Gladiator"});

Or we could just look for the movies shot in a specific year, for example by invoking db.movie.find({"year":2008});. If we instead wanted to look for movies from a certain time period, we use the operators $gt, $gte, $lt, $lte which stand for >, >=, < and <=, respectively. Like in the query

db.movie.find({"year":{$gt:2007}})

which retrieves all movies where the year is greater than 2007, in our simple example Avatar and The Dark Knight. We combine different conditions as in

db.movie.find({"year":{$gte:2007,$lte:2009}})

which retrieves documents from the years 2007-2009. When working with strings, one might want to fetch all documents for which a specific key is in a list of values, for example we might want to retrieve all movies directed by Christopher Nolan and Ridley Scott. To do this, we use the $in operator which expects an Array of values which should be searched for. Hence,

db.movie.find({"director":{$in: ["Ridley Scott","Christopher Nolan"]}})

will retrieve the movies Gladiator, The Prestige and The Dark Knight. Remember that actors is an Array of different names. Nonetheless we can query it in the same way: To search for all movies in which Russsell Crowe plays a part, we can do

db.movie.find({"actors":"Russell Crowe"})

This query does return only Gladiator, which in our contrived example is not only the only movie in which he plays a role, but also the only document with actors attached. Hence, as we can see, it is possible in a mongo database to have data of varying form and still be able to perform valid queries.
We can also combine different queries using the $or and $and operators. For example

db.movie.find({$or: [{"year":2009},{"director":"Christopher Nolan"}]})

returns all movies that were directed by Christopher Nolan OR shot in the year 2009. To find all movies directed by him and shot in 2009, we would call db.movie.find({$and: [{"year":2009},{"director":"Christopher Nolan"}]}), which does not return any results for our test database (although this is true in reality, too: His next movie after The Dark Knight, the very good mind-bending experience Inception was only released in 2010).

There is a lot more to formulating queries in mongo, and I have only touched the surface. But the documentation under http://docs.mongodb.org/manual/ is quite extensive, and so always a good resource when trying to figure out how to do a certain query.

In a follow-up post, I will discuss how to formulate queries for more complicated collections where each documents may contain elements which themselves are JSON objects (so called "sub-documents").