Get a fixed random sample from all documents

Sebastian_Rickelt · April 24, 2015, 2:02pm

Hi,

I want to fetch a fixed large number of documents randomly from
Elasticsearch to compute some statistics (100,000 out of 10 M documents).
The randomness has to be predictable so that I get the same documents with
every request.

My problem is that scan and scroll is fast but as I understand the order is
not predictable. On the other side I could use the 'random_score' function
with a fixed seed in my query. That would fix the order problem but deep
pagination is very slow. Has anyone done this before? Any ideas or pointers
how to do this with Elasticsearch?

Any help appreciated.

Cheers,

Sebastian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e00e363a-5346-48bd-807c-4b221bed7c28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

cbuescher · October 16, 2015, 11:53am

Hi Sebastian,

I just stumbled on your question, did you have any luck with your random sampling yet? If your data has some criterion to filter on that is present in all documents and that divides the data into chunks that are small enough to make pagination feasible, you could combine the random_score with fixed seed with a filter on that criterion. Then you can repeatedly sample from the chunks and combine the results. Should be predictable as long as the data stays the same.

Topic		Replies	Views
Random scan results? Elasticsearch	4	1307	July 6, 2017
Function score - random score and Scan query? Elasticsearch	1	770	July 6, 2017
Random sampling performance Elasticsearch	1	555	July 6, 2017
Random_score page consistency Elasticsearch	1	392	July 6, 2017
Sorting a random set of documents Elasticsearch	2	727	July 6, 2017

Get a fixed random sample from all documents

Related topics