In order to process a large number of documents from an index I have a number of instances of an app running in parallel, each processing a subset of the documents. Each subset is defined by a particular date range. The number of documents created per day varies over the year, so I would like to determine the date ranges that would lead to (say) 4 buckets containing approximately the same number of documents each.
For an even split I’d be tempted to do something based on the hash-modulo of the doc ID. So a scripted query where hash(id)%5==1 for client 1 and hash(id)%5==2 for the second client and so on. That’s basically how doc routing places documents evenly between shards.
One aspect of this that I didn't mention is that to increase resilience to failures while extracting this data, each day within each bucket is processed as a unit of work, so that if a failure occurs, that day can be logged and reprocessed individually without needing to redo the whole bucket. The entire set of data to be extracted is defined by an overall date range, so splitting it further into date range based buckets and then into single day tasks seemed natural.
The restart position is actually specified as part of the search_after parameter which is the recommended way for paginating deep into a set of search results. The hash-modulo example query I gave can be used in conjunction with the search-after parameter with a date sort order. If you also use a PIT (point in time) id (see the search after docs) this will provide resilience in the case of a shard node going down.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.