I am looking to get a random record from a huge collection (100 million records). What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.
29.2k 28 28 gold badges 134 134 silver badges 197 197 bronze badges asked May 13, 2010 at 2:43 4,481 3 3 gold badges 17 17 silver badges 3 3 bronze badgesSee also this SO question titled "Ordering a result set randomly in mongo". Thinking about randomly ordering a result set is a more general version of this question -- more powerful and more useful.
Commented Jun 15, 2012 at 20:30This question keeps popping up. The latest information can likely be found at the feature request to get random items from a collection in the MongoDB ticket tracker. If implemented natively, it would likely be the most efficient option. (If you want the feature, go vote it up.)
Commented Jun 17, 2012 at 2:37 Is this a sharded collection? Commented Jul 27, 2013 at 17:51Does anyone know how much slower this is than just taking the first record? I’m debating whether it’s worth taking a random sample to do something vs just doing it in order.
Commented Feb 6, 2020 at 15:00Actually opposite of the answers $sample might not be fastest solution. Because mongo may do a collection scan for random sorting when using $sample depending on the situation. Please see: Reference: docs.mongodb.com/manual/reference/operator/aggregation/sample Maybe doing counting result set and doing some random skip take will do better.
Commented Dec 5, 2020 at 7:51Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sample aggregation pipeline operator:
// Get one random document from the mycoll collection. db.mycoll.aggregate([ < $sample: < size: 1 >>])
If you want to select the random document(s) from a filtered subset of the collection, prepend a $match stage to the pipeline:
// Get one random document matching from the mycoll collection. db.mycoll.aggregate([ < $match: < a: 10 >>, < $sample: < size: 1 >> ])
As noted in the comments, when size is greater than 1, there may be duplicates in the returned document sample.
answered Nov 7, 2015 at 2:28 311k 69 69 gold badges 628 628 silver badges 475 475 bronze badgesThis is a good way, but remember that it DO NOT guarantee that there are no copies of the same object in the sample.
Commented Jan 6, 2016 at 1:28 @MatheusAraujo which won't matter if you want one record but good point anyway Commented Jan 10, 2016 at 3:35Not to be pedantic but the question doesn't specify a MongoDB version, so I'd assume having the most recent version is reasonable.
Commented Apr 7, 2016 at 17:35 @Nepoxx See the docs regarding the processing involved. Commented Jun 7, 2016 at 13:32@brycejl That would have the fatal flaw of not matching anything if the $sample stage didn't select any matching documents.
Commented Apr 19, 2020 at 0:21Do a count of all records, generate a random number between 0 and the count, and then do:
db.yourCollection.find().limit(-1).skip(yourRandomNumber).next()
47.6k 10 10 gold badges 105 105 silver badges 158 158 bronze badges
answered May 13, 2010 at 2:48
180k 41 41 gold badges 308 308 silver badges 377 377 bronze badges
Unfortunately skip() is rather inefficient since it has to scan that many documents. Also, there is a race condition if rows are removed between getting the count and running the query.
Commented May 17, 2010 at 18:49Note that the random number should be between 0 and the count (exclusive). I.e., if you have 10 items, the random number should be between 0 and 9. Otherwise the cursor could try to skip past the last item, and nothing would be returned.
Commented Apr 20, 2011 at 22:05Thanks, worked perfectly for my purposes. @mstearn, your comments on both efficiency and race conditions are valid, but for collections where neither matters (one-time server-side batch extract in a collection where records aren't deleted), this is vastly superior to the hacky (IMO) solution in the Mongo Cookbook.
Commented Sep 5, 2012 at 16:27 what does setting the limit to -1 do? Commented Jan 27, 2013 at 12:46@MonkeyBonkey docs.mongodb.org/meta-driver/latest/legacy/… "If numberToReturn is 0, the db will use the default return size. If the number is negative, then the database will return that number and close the cursor."
Commented Jan 27, 2013 at 15:243.2 introduced $sample to the aggregation pipeline.
There's also a good blog post on putting it into practice.
This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533 but it was filed under "Won't fix."
The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/
To paraphrase the recipe, you assign random numbers to your documents:
db.docs.save( < key : 1, . random : Math.random() >)
Then select a random document:
rand = Math.random() result = db.docs.findOne( < key : 2, random : < $gte : rand >> ) if ( result == null ) < result = db.docs.findOne( < key : 2, random : < $lte : rand >> ) >
Querying with both $gte and $lte is necessary to find the document with a random number nearest rand .
And of course you'll want to index on the random field:
db.docs.ensureIndex( < key : 1, random :1 >)
If you're already querying against an index, simply drop it, append random: 1 to it, and add it again.
26.1k 11 11 gold badges 75 75 silver badges 104 104 bronze badges answered Apr 1, 2011 at 18:17 1,839 14 14 silver badges 4 4 bronze badgesAnd here is a simple way to add the random field to every document in the collection. function setRandom() < db.topics.find().forEach(function (obj)
This selects a document randomly, but if you do it more than once, the lookups are not independent. You are more likely to get the same document twice in a row than random chance would dictate.
Commented Jan 10, 2012 at 2:19Looks like a bad implementation of circular hashing. It's even worse than lacker says: even one lookup is biased because the random numbers aren't evenly distributed. To do this properly, you'd need a set of, say, 10 random numbers per document. The more random numbers you use per document, the more uniform the output distribution becomes.
Commented Mar 29, 2012 at 21:11The MongoDB JIRA ticket is still alive: jira.mongodb.org/browse/SERVER-533 Go comment and vote if you want the feature.
Commented Jun 15, 2012 at 20:32Take note the type of caveat mentioned. This does not work efficiently with small amount of documents. Given two items with random key of 3 and 63. The document #63 will be chosen more frequently where $gte is first. Alternative solution stackoverflow.com/a/9499484/79201 would work better in this case.
Commented Oct 30, 2013 at 15:50You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.
First, enable geospatial indexing on a collection:
db.docs.ensureIndex( < random_point: '2d' >)
To create a bunch of documents with random points on the X-axis:
for ( i = 0; i < 10; ++i ) < db.docs.insert( < key: i, random_point: [Math.random(), 0] >); >
Then you can get a random document from the collection like this:
db.docs.findOne( < random_point : < $near : [Math.random(), 0] >> )
Or you can retrieve several document nearest to a random point:
db.docs.find( < random_point : < $near : [Math.random(), 0] >> ).limit( 4 )
This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.
answered Feb 29, 2012 at 12:50 Nico de Poel Nico de Poel 772 5 5 silver badges 4 4 bronze badgesI like this answer, Its the most efficient one I've seen that doesn't require a bunch of messing about server side.
Commented Mar 10, 2012 at 17:58 This is also biased towards documents that happen to have few points in their vicinity. Commented Mar 29, 2012 at 21:13That is true, and there are other problems as well: documents are strongly correlated on their random keys, so it's highly predictable which documents will be returned as a group if you select multiple documents. Also, documents close to the bounds (0 and 1) are less likely to be chosen. The latter could be solved by using spherical geomapping, which wraps around at the edges. However, you should see this answer as an improved version of the cookbook recipe, not as a perfect random selection mechanism. It's random enough for most purposes.
Commented Mar 30, 2012 at 11:51@NicodePoel, I like your answer as well as your comment! And I have a couple of questions for you: 1- How do you know that points close to bounds 0 and 1 are less likely to be chosen, is that based on some mathematical ground?, 2- Can you elaborate more on spherical geomapping, how it will better the random selection, and how to do it in MongoDB? . Appreciated!
Commented Sep 10, 2015 at 12:47 Apprichiate your idea. Finally, I have a great code that is much CPU & RAM friendly! Thank you Commented Mar 3, 2020 at 22:49The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random ) solution, but much faster and more fail-safe in case documents are removed.
function draw(collection, query) < // query: mongodb query object (optional) var query = query || < >; query['random'] = < $lte: Math.random() >; var cur = collection.find(query).sort(< rand: -1 >); if (! cur.hasNext()) < delete query.random; cur = collection.find(query).sort(< rand: -1 >); > var doc = cur.next(); doc.random = Math.random(); collection.update(< _id: doc._id >, doc); return doc; >
It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey
function addRandom(collection) < collection.find().forEach(function (obj) < obj.random = Math.random(); collection.save(obj); >); > db.eval(addRandom, db.things);
Benchmark results
This method is much faster than the skip() method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:
For a collection with 1,000,000 elements:
The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.
This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.
Here is a way using the default ObjectId values for _id and a little math and logic.
// Get the "min" and "max" timestamp values from the _id in the collection and the // diff between. // 4-bytes from a hex string is 8 characters var min = parseInt(db.collection.find() .sort(< "_id": 1 >).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000, max = parseInt(db.collection.find() .sort(< "_id": -1 >)limit(1).toArray()[0]._id.str.substr(0,8),16)*1000, diff = max - min; // Get a random value from diff and divide/multiply be 1000 for The "_id" precision: var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000; // Use "random" in the range and pad the hex string to a valid ObjectId var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000") // Then query for the single document: var randomDoc = db.collection.find( < "_id": < "$gte": _id >>) .sort(< "_id": 1 >).limit(1).toArray()[0];
That's the general logic in shell representation and easily adaptable.
This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.
answered Jun 26, 2015 at 11:06 Blakes Seven Blakes Seven 50.3k 14 14 gold badges 131 131 silver badges 136 136 bronze badges I have a collection of 300 000 000 lines. This is the only solution that works and it's fast enough. Commented Apr 14, 2019 at 6:51Now you can use the aggregate. Example:
db.users.aggregate( [ < $sample: < size: 3 >> ] )
answered Feb 6, 2017 at 17:00
1,108 1 1 gold badge 13 13 silver badges 16 16 bronze badges
Note: $sample may get the same document more than once
Commented May 29, 2017 at 4:46
In Python using pymongo:
import random def get_random_doc(): count = collection.count() return collection.find()[random.randrange(count)]
answered Jan 24, 2015 at 14:38
20.4k 6 6 gold badges 55 55 silver badges 48 48 bronze badges
Worth noting that internally, this will use skip and limit, just like many of the other answers.
Commented Jan 24, 2015 at 15:07
Your answer is correct. However, please replace count() with estimated_document_count() as count() is deprecated in Mongdo v4.2.
Commented Jun 11, 2020 at 23:50Using Python (pymongo), the aggregate function also works.
collection.aggregate([>])
This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.
answered Apr 17, 2018 at 14:37 483 4 4 silver badges 9 9 bronze badgesit is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:
lowest = db.coll.find().sort().limit(1).next()._id; highest = db.coll.find().sort().limit(1).next()._id;
then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):
unsigned long long L = first_8_bytes_of(lowest) unsigned long long H = first_8_bytes_of(highest) V = (H - L) * random_from_0_to_1(); N = L + V; oid = N concat random_4_bytes(); randomobj = db.coll.find(>).limit(1);
answered May 13, 2010 at 13:48
2,002 13 13 silver badges 8 8 bronze badges
Any ideas how would that look like in PHP? or at least what language have you used above? is it Python?
Commented May 20, 2013 at 18:03You can pick a random timestamp and search for the first object that was created afterwards. It will only scan a single document, though it doesn't necessarily give you a uniform distribution.
var randRec = function() < // replace with your collection var coll = db.collection // get unixtime of first and last record var min = coll.find().sort().limit(1)[0]._id.getTimestamp() - 0; var max = coll.find().sort().limit(1)[0]._id.getTimestamp() - 0; // allow to pass additional query params return function(query) < if (typeof query === 'undefined') query = <>var randTime = Math.round(Math.random() * (max - min)) + min; var hexSeconds = Math.floor(randTime / 1000).toString(16); var + "0000000000000000"); query._id = return coll.find(query).limit(1) >; >();
answered Dec 4, 2014 at 23:37
Martin Nowak Martin Nowak
1,427 15 15 silver badges 9 9 bronze badges
It would be easily possible to skew the random date to account for superlinear database growth.
Commented Mar 31, 2015 at 18:20
this is the best method for very large collections, it works at O(1), unline skip() or count() used in the other solutions here
Commented Nov 2, 2016 at 9:04My solution on php:
/** * Get random docs from Mongo * @param $collection * @param $where * @param $fields * @param $limit * @author happy-code * @url happy-code.com */ private function _mongodb_get_random (MongoCollection $collection, $where = array(), $fields = array(), $limit = false) < // Total docs $count = $collection->find($where, $fields)->count(); if (!$limit) < // Get all docs $limit = $count; >$data = array(); for( $i = 0; $i < $limit; $i++ ) < // Skip documents $skip = rand(0, ($count-1) ); if ($skip !== 0) < $doc = $collection->find($where, $fields)->skip($skip)->limit(1)->getNext(); > else < $doc = $collection->find($where, $fields)->limit(1)->getNext(); > if (is_array($doc)) < // Catch document $data[ $doc['_id']-> ] = $doc; // Ignore current document when making the next iteration $where['_id']['$nin'][] = $doc['_id']; > // Every iteration catch document and decrease in the total number of document $count--; > return $data; >
answered Dec 23, 2014 at 17:29
code_turist code_turist
41 1 1 bronze badge
The best way in Mongoose is to make an aggregation call with $sample. However, Mongoose does not apply Mongoose documents to Aggregation - especially not if populate() is to be applied as well.
For getting a "lean" array from the database:
/* Sample model should be init first const Sample = mongoose … */ const samples = await Sample.aggregate([ < $match: <>>, < $sample: < size: 33 >>, ]).exec(); console.log(samples); //a lean Array
For getting an array of mongoose documents:
const samples = ( await Sample.aggregate([ < $match: <>>, < $sample: < size: 27 >>, < $project: < _id: 1 >>, ]).exec() ).map(v => v._id); const mongooseSamples = await Sample.find( < _id: < $in: samples >>); console.log(mongooseSamples); //an Array of mongoose documents
answered Apr 6, 2021 at 9:21
491 5 5 silver badges 6 6 bronze badges
How to bring only certain fields and not the whole record?
Commented Apr 28, 2022 at 23:26
Commented May 2, 2022 at 15:10
In order to get a determinated number of random docs without duplicates:
number_of_docs=7 db.collection('preguntas').find(<>,).toArray(function(err, arr) < count=arr.length idsram=[] rans=[] while(number_of_docs!=0)< var R = Math.floor(Math.random() * count); if (rans.indexOf(R) >-1) < continue >else < ans.push(R) idsram.push(arr[R]._id) number_of_docs-- >> db.collection('preguntas').find(<>).toArray(function(err1, doc1) < if (err1) < console.log(err1); return; >res.send(doc1) >); >);
108 1 1 silver badge 10 10 bronze badges
answered Dec 19, 2015 at 20:13
Fabio Guerra Fabio Guerra
722 6 6 silver badges 14 14 bronze badges
My simplest solution to this .
db.coll.find() .limit(1) .skip(Math.floor(Math.random() * 500)) .next()
Where you have at least 500 items on collections
answered Sep 22, 2022 at 3:26 Irfan Habib Irfan Habib 158 1 1 gold badge 1 1 silver badge 11 11 bronze badgesI would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.
function mapf() < if(Math.random() > function reducef(key,values) < return ; > res = db.questions.mapReduce(mapf, reducef, , "scope": < "probability": 0.5>>); printjson(res.results);
The reducef function above works because only one key ('1') is emitted from the map function.
The value of the "probability" is defined in the "scope", when invoking mapRreduce(. )
Using mapReduce like this should also be usable on a sharded db.
If you want to select exactly n of m documents from the db, you could do it like this:
function mapf() < if(countSubset == 0) return; var prob = countSubset / countTotal; if(Math.random() ); countSubset--; > countTotal--; > function reducef(key,values) < var newArray = new Array(); for(var i=0; i < values.length; i++) < newArray = newArray.concat(values[i].documents); >return ; > res = db.questions.mapReduce(mapf, reducef, , "scope": >) printjson(res.results);
Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.
This approach might give some problems on sharded databases.
answered Feb 26, 2012 at 13:43 296 2 2 silver badges 6 6 bronze badgesDoing a full collection scan to return 1 element. this must be the least efficient technique to do it.
Commented Mar 29, 2012 at 21:14The trick is, that it is a general solution for returning an arbitrary number of random elements - in which case it would be faster than the other solutions when getting > 2 random elements.
Commented Feb 6, 2014 at 10:52You can pick random _id and return corresponding object:
db.collection.count( function(err, count) < db.collection.distinct( "_id" , function( err, result) < if (err) res.send(err) var randomId = result[Math.floor(Math.random() * (count-1))] db.collection.findOne( < _id: randomId >, function( err, result) < if (err) res.send(err) console.log(result) >) >) >)
Here you dont need to spend space on storing random numbers in collection.
answered Apr 30, 2015 at 4:24 565 2 2 gold badges 9 9 silver badges 18 18 bronze badgesThe following aggregation operation randomly selects 3 documents from the collection:
answered Oct 16, 2020 at 9:09 Anup Panwar Anup Panwar 293 4 4 silver badges 11 11 bronze badgesMongoDB now has $rand
To pick n non repeat items, aggregate with < $addFields: < _f: < $rand: <>> > > then $sort by _f and $limit n.
answered Feb 23, 2021 at 15:38 2,168 1 1 gold badge 21 21 silver badges 35 35 bronze badges any example plz? Commented Nov 24, 2021 at 8:13I'd suggest adding a random int field to each object. Then you can just do a
findOne(>)
to pick a random document. Just make sure you ensureIndex()
62.8k 13 13 gold badges 185 185 silver badges 230 230 bronze badges answered May 17, 2010 at 18:47 4,226 1 1 gold badge 20 20 silver badges 18 18 bronze badgesIf the first record in your collection has a relatively high random_field value, won't it be returned almost all the time?
Commented Jan 23, 2013 at 23:03 thehaitus is correct, it will -- it is not suitable for any purpose Commented Aug 7, 2013 at 21:54This solution is completely wrong, adding a random number (let's imagine in between 0 a 2^32-1) doesn't guarantee any good distribution and using $gte makes it even worst, due to your random selection won't be even close to a pseudo-random number. I suggest not to use this concept ever.
Commented Dec 2, 2013 at 20:32When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB.
In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute a q score modifier. When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory).
If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user.
If the universe of products is small enough, you can create an index per user.
I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution.