Philipp Burckhardt

On Statistics, Programming and the Social Sciences

Sampling from a mongoDB database

Recently, I had to deal with the challenge of drawing a random sample of tweets from a large database holding over 1TB of tweets that we have collected for a research project involving sentiment analysis of Twitter data. By now, even the simplest queries take an enormous amount of time to complete.
In order to gain some insights into the collection and build a sentiment classifier, we decided to draw a random sample of documents of our collection. This is easier said than done, as mongo does not supply any built-in functionality to make this happen. One possibility would be to draw a sample, say of size 500, from the collection tweets. A query for this could look like this:

var ntotal = db.tweets.count();  
var docs = [];  
for ( var i = 0; i < 500; i++ ) {  
    var random = Math.floor( Math.random() * ntotal );
    var doc = db.tweets
      .find()
      .skip( random )
      .limit( 1 )
      .next();
    docs.push( doc );
}

However, this approach does not scale, because at each iteration of the loop we have to skip over a large chunk of the document collection. Given the size of our database, we need to look for a different approach. One idea is to attach to each document a uniform random number between 0 and 1 and write a query for this random number. We can attach such a random number to each document by executing the following command:

db.tweets.find().forEach( function( doc, i ) {  
  db.tweets.update( { _id: doc._id }, { $set: {random: Math.random()} } )
});

To draw a random sample of size n, we could now draw another uniform random number and then return the n documents larger than the generated number.

But first, in order to speed things up, we should index the database on the newly created random field:

db.tweets.ensureIndex( {random: 1} );  

At last, we are now in the position to sample the tweets. The following command will print out the sampled documents as a JSON object:

var sampleSize = 500;  
var start = Math.random();  
var docs = db.tweets
  .find( { random: {$gt: start} } )
  .sort( { random: 1 } )
  .limit( sampleSize );

printjson( docs.toArray() );  

Notice that for the cursor returned by find, we first sort it in increasing order. If you were to omit this step, you would always simple draw the first documents in the collection larger than start, an undesirable outcome. We then simply limit the number of retrieved documents by chaining the command .limit( sampleSize ).

If the file containing above code was called sample.js, we could execute it and redirect the results to the file sample.json via the command

mongo --quiet sample.js > sample.json