Get a fixed random sample from all documents


(Sebastian Rickelt) #1

Hi,

I want to fetch a fixed large number of documents randomly from
Elasticsearch to compute some statistics (100,000 out of 10 M documents).
The randomness has to be predictable so that I get the same documents with
every request.

My problem is that scan and scroll is fast but as I understand the order is
not predictable. On the other side I could use the 'random_score' function
with a fixed seed in my query. That would fix the order problem but deep
pagination is very slow. Has anyone done this before? Any ideas or pointers
how to do this with Elasticsearch?

Any help appreciated.

Cheers,

Sebastian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e00e363a-5346-48bd-807c-4b221bed7c28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Christoph) #2

Hi Sebastian,

I just stumbled on your question, did you have any luck with your random sampling yet? If your data has some criterion to filter on that is present in all documents and that divides the data into chunks that are small enough to make pagination feasible, you could combine the random_score with fixed seed with a filter on that criterion. Then you can repeatedly sample from the chunks and combine the results. Should be predictable as long as the data stays the same.


(system) #3