I want to take a copy of some production Elasticsearch indexes and put a sample of them into a test cluster. I can use logstash to do the copy, and I know I can filter it in logstash using something like
filter { ruby { code => "event.cancel if rand <= 0.99" } }
However, I would like to be able to do the filtering in Elasticsearch so that logstash never sees the records it is going to drop. If they were small indexes I could use random_score and size to filter top-N, but my understanding is that that will not scale to asking for the top 10,000,000 documents. Is there another way?