Agg query that retturns bucket ranges so that buckets have the same number of documents?

ankh · August 31, 2023, 12:34am

In order to process a large number of documents from an index I have a number of instances of an app running in parallel, each processing a subset of the documents. Each subset is defined by a particular date range. The number of documents created per day varies over the year, so I would like to determine the date ranges that would lead to (say) 4 buckets containing approximately the same number of documents each.

Is there a function that allows this?

Mark_Harwood1 · August 31, 2023, 8:20am

For an even split I’d be tempted to do something based on the hash-modulo of the doc ID. So a scripted query where hash(id)%5==1 for client 1 and hash(id)%5==2 for the second client and so on. That’s basically how doc routing places documents evenly between shards.

ankh · September 1, 2023, 7:19am

That's a clever idea, thank you.

One aspect of this that I didn't mention is that to increase resilience to failures while extracting this data, each day within each bucket is processed as a unit of work, so that if a failure occurs, that day can be logged and reprocessed individually without needing to redo the whole bucket. The entire set of data to be extracted is defined by an overall date range, so splitting it further into date range based buckets and then into single day tasks seemed natural.

Mark_Harwood1 · September 1, 2023, 4:00pm

The restart position is actually specified as part of the search_after parameter which is the recommended way for paginating deep into a set of search results. The hash-modulo example query I gave can be used in conjunction with the search-after parameter with a date sort order. If you also use a PIT (point in time) id (see the search after docs) this will provide resilience in the case of a shard node going down.

system · September 29, 2023, 4:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Documents not getting sharded evenly Elasticsearch	14	1657	July 5, 2017
Routing in elastic search with range values Elasticsearch	4	357	July 6, 2017
Choosing which shard a document can go to? Elasticsearch	10	2351	July 5, 2017
Routing in elastic search with range values Elasticsearch	3	350	July 6, 2017
Shards/routing documents imbalance problem Elasticsearch	9	774	July 6, 2017

Agg query that retturns bucket ranges so that buckets have the same number of documents?

Related topics