We're learning the hard way that daily indices for a large variety of log types that are nevertheless related is a surefire way to swamp your cluster in shard state. Consequently, we've decided to try out the _reindex API to amalgamate all the indices of a given day into one, greatly reducing shard state in the process. Initial testing of the _reindex API seemed promising, with sources of multiple minor test indices neatly packing into one index.
So far, so good.
Reviewing our first runs of reindexing major production indices, we are missing documents. A lot. On the order of 20-40% of the sum of all the indices missing in the finished index.
I am not seeing errors in the logs. The data nodes are certainly busy, but CPU stats indicate merely that they are breaking a little sweat for a change. Heap usage is undramatic and not very noticeable. In short, things seem rather quite nice with the cluster.
We are running three data nodes with plenty of CPU power and memory, and SSDs in RAID0.
We attempted an experiment in Console wherein we first created the target index manually rather than automatically, then set refresh_interval=-1 and replicas to 0, then added one and one source index in ascending order according to size. Then reset the settings to 60s refresh interval and 2 replicas, and watched the document count rise in the target index.
The first few indices, typically with less than 1 million documents, packed together nicely, and the sum of the source index counts matched the target index count exactly. Then, adding a new index of 13M documents, we now found documents missing.
We are rather at a loss here. What tools or APIs will best allow us to diagnose where the documents are disappearing? What cluster or index settings might be of use?
We keep seeing this error message, as also detailed in https://github.com/elastic/elasticsearch/issues/26153 . I'll try to increase thread_pool.search.queue_size to 5000 in an attempt to circumvent the failure. In any case, it would seem that the _reindex API isn't functioning as intended.
We also have noted that setting "version_type": "external" alleviated the issue somewhat, as did setting the number of slices equal to the number of shards in the target index. But it only postpones the failures, and in any case, the problem appears to be a logical one rather than a question of raw capacity, and we would assume the reindex operation to be one that is carried out in assured completeness by it retrying operations that time out or are rejected.
Referencing the github issue at https://github.com/elastic/elasticsearch/issues/26153, the last update as of 20180320 recommends using the search_after setting in the Scroll API to circumvent this problem. How exactly would this be done when using the Reindex API?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.