My understanding is that cluster state should rapidly transition from yellow to green thanks to the synced flush. However, shard allocation hits throttling and is therefore slow as hell.
Did I miss something?
Our cluster currently contains too many shards and we are working toward reducing it. Will it solve our problem or is there other factors influencing node restart duration?
Did the response to the synced flush indicate that it was completely successful? Did you stop indexing while the node was offline? If the answer to either question is no then it's possible that the synced flush marker isn't there on every shard (either it wasn't put in place, or it was put there and then removed) and this results in a slower recovery.
The synced flush indicates that almost every shards are successful: only 22 out of 22472 failed (we really have too many shards). Indexing wasn't stopped during node restart but only a small number of shards should be touched (I estimate the maximum number to be 642).
Having 6 data nodes, 3745 (22472 / 6) shards are unassigned after a node restart and I expect maximum 107 (642 / 6) shards to be slow recovering and the remaining shards to recover very quickly (as their flush marker shouldn't have changed).
For a shard which has been touched during node restart (resulting in its flush marker changing), is its recovery duration function of its size?
It depends. In some recoveries Elasticsearch has to make a brand-new copy of the shard. It will re-use any segments that it can, but often there aren't many of these. This was the case for all recoveries in versions before 6.0, and is still the case in more recent versions if there's been too many changes (>512MB of translog), or the node has been offline for too long (>12h), or the new copy is assigned to a different node from the node that holds the previous, stale, copy of the shard.
Is that different from what you're seeing? Are you seeing shards recover that you weren't expecting to need recovery?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.