Split data into into two sets (test/train)

eliase · May 18, 2021, 6:52am

Hi,

for machine learning validation I would like to split the data in elastic search into two sets.
The random score in combination with the size parameter would give me one set but how can I obtain all other documents? Is there an easy way to do that?

Mark_Harwood · May 18, 2021, 7:16am

If you use multiple shards to index the data it’s already randomly distributed across shards and you can use search routing to query only one of several shards.

eliase · May 18, 2021, 7:26am

Thanks for the quick reply. So far, it's only in one shard.

Even if there were multiple shards, that doesn't seem to be very flexible solution in regards to getting a specified split ratio like 20% / 80%

Mark_Harwood · May 18, 2021, 9:01am

The routing function is a common hash modulo N function where N is the number of shards.
You could apply the same logic to querying the ID of documents with a script query e.g. something like:

{
  "query": {
    "script": {
      "script": " doc['_id'].value.hashCode()%2 == 0"
    }
  }
}

system · June 15, 2021, 9:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to distribute documents across shards equally, using _routing in Elasticsearch Elasticsearch	10	720	April 25, 2022
Choosing which shard a document can go to? Elasticsearch	10	2370	July 5, 2017
How to route docs of same _routing key in Elasticsearch into multiple shard? Elasticsearch	3	137	April 8, 2024
Docs about sharding and scatter/gather Elasticsearch	5	1948	July 6, 2017
Custom routing of shard number Elasticsearch	4	1621	July 5, 2017

Split data into into two sets (test/train)

Related topics