Thank you for clearing up that misunderstanding. Then your cluster setup should work fine without any unassigned replicas lingering.
As for the original problem, I don't know the internal workings of Elasticsearch that well so I'm afraid I can't explain why the replicas seem to recover from scratch in your cluster. The only thing I can suggest is that the slow recovery is somehow related to the high number of shards in the cluster.
In case you're uncertain about the important differences between a primary and a replica shard, which holds the same data and thus may look the same, I'll add a few words.
It's important to keep in mind that a replica is not a crucial element of an index whereas each primary shard is vital for the index operation. An index will be in a red state until all of its primary shards have been recovered, you can't index into it and not get a sensible search result back from such an index. Hence, from an operational perspective it's much more important to recover the primary shards than the replicas. If there is competition for system resources, perhaps a throttling of I/O, it makes sense for Elasticsearch to focus its limited resources on the primary shards and only when they have all been recovered, and the index goes from red to a yellow state, start recovering the replicas.
In order to improve your cluster performance I still believe the best line of action would be to look at the number of shards, both primary and replica, and try to lower those numbers. One of the recommendations in the earlier mentioned blog post is to aim for shard sizes of 20-40 GB, but I have also heard 30-50 GB from Elastic technicians. So for my indices I try to keep the shards in that range (brand new shards will start out small, naturally, but should grow to 20-50 GB). For instance, if I estimate a monthly log index to take about 100 GB of data, I assign 3 primary shards to it. This way I reduce the overall shard count, so that if each node has say 600 GB of data there will be just 20-30 shards on each.
When it comes to the number of replica shards in an index I once tried setting it to N-1 in a cluster of N data nodes, naively thinking it would be best to have the complete data set on each node both for scaling and for redundancy. But that is wrong for several reasons.
-
It doesn't scale at all. Since every node has a full copy of the index data it means once one node is getting full (at 85% disc usage, no new shards can be assigned to the node) all of them are. The only way to scale the cluster for more data is to increase disc capacity on each node by mounting more discs and update the path.data setting in elasticsearch.yml. This is both cumbersome and very error prone. A much better way to scale a cluster for more data is to add new nodes, allowing shards to be moved from the nearly full nodes to the empty ones, and this only works if the cluster isn't designed to have a full copy of index data on each node.
-
It makes low level caching impossible. By having a full copy of the index data on each node you will probably end up with many hundred gigabytes or even terabytes of data and there is no way to cache more than a tiny fraction of that in the low level caches, which means Elasticsearch will have to read just about every piece of data from disc rather than use the cache to return query results. If, on the other hand, data is spread out across many nodes you may have only 2-300 GB of data per node which means a larger fraction of the data can be cached between requests, making Elasticsearch respond much faster.
-
It gives a false sense of redundancy. One reason for having N-1 replicas is based on the flawed idea that having a full copy of all indices on every node will allow you to keep the cluster running even if most nodes die. But this is not so because if N-2 nodes die it's probably just a question of time before the last one does too, especially since it will have to handle all the index and search requests all alone, which will quickly overwhelm it. A better policy is to monitor the cluster closely and make sure to fix the problem if one node dies, before more nodes go down.
To summarize, I would recommend:
- reducing the number of replica shards for all indices to just 1.
- using Shrink Index or Reindex API to reduce the number of primary shards per index to obtain shard sizes in the 20-50 GB range.
But whether this solves your slow replica recovery I don't know.
Good luck!