for machine learning validation I would like to split the data in elastic search into two sets.
The random score in combination with the size parameter would give me one set but how can I obtain all other documents? Is there an easy way to do that?
If you use multiple shards to index the data it’s already randomly distributed across shards and you can use search routing to query only one of several shards.
The routing function is a common hash modulo N function where N is the number of shards.
You could apply the same logic to querying the ID of documents with a script query e.g. something like:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.