Split data into into two sets (test/train)


for machine learning validation I would like to split the data in elastic search into two sets.
The random score in combination with the size parameter would give me one set but how can I obtain all other documents? Is there an easy way to do that?

If you use multiple shards to index the data it’s already randomly distributed across shards and you can use search routing to query only one of several shards.

Thanks for the quick reply. So far, it's only in one shard.

Even if there were multiple shards, that doesn't seem to be very flexible solution in regards to getting a specified split ratio like 20% / 80%

The routing function is a common hash modulo N function where N is the number of shards.
You could apply the same logic to querying the ID of documents with a script query e.g. something like:

  "query": {
    "script": {
      "script": " doc['_id'].value.hashCode()%2 == 0"

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.