Understanding recovery: primary vs replica performance difference

I have a fairly simple cluster running 6.2.4 with 3 master nodes, 3 data nodes, 70 indices * 20 shards * 3 replicas. The cluster is near empty at the moment with about 3GB of data total. In short, every shard is replicated on all nodes.

I realize this is severely oversharded, but I am trying to understand how the different shards affect the recovery process.

When I restart the entire cluster, it recovers from red to yellow fairly fast in about 1 minute. The load on the system seems to be very low during this time. However, recovering from yellow to green, i.e. reassigning all the replicas seems to take about 10 minutes. During this time the load on all nodes spikes with a high amount of disk access.

I experimented with setting the number_of_replicas to 0 and back to 2 after that. The time it takes to create all the replicas from scratch is about the same 10 minutes it took for the cluster to recover after a startup.

Why does this happen? Does it actually recreate the replicas from the primary instead of recovering what is already there? Is there a reason it treats primaries different than replicas? Could this be a bug?

This blog post should give you a good overview of how Elasticsearch handles shards, you will also find some good design rules on how to do the sharding. For instance, the recommendation that you have no more than 20-25 shards per GB of heap space: If you run Elasticsearch with the maximum recommended Java Heap Size, 30 GB, you should have no more than 25*30 = 750 shards on each data node. If you have only 16 GB of heap space, the number of shards should be no more than 25 * 16 = 400.

I understand this to mean that you've got 3 dedicated master nodes, where no data is stored (node.data: false), and that all shards are stored on the 3 data nodes. If so, your data nodes currently have 70 * 20 primary and 70 * 20 * 3 replica shards or a total of 5600 shards. If you divide that evenly between the 3 nodes they get about 1866 shards each. And that is clearly too many shards, at least twice as many as recommended in the blog post.

A problem with oversharding is that the cluster state explodes, because the cluster needs to keep tab on each of the shards to know which are primary shards and whether their replicas are up to date. This is important to know in case a data node goes down and its primary shards become unavailable. Then the cluster must promote replica shards on the other nodes to new primaries.

Another issue. Since you only have 3 data nodes there is no point in having 3 replicas because a replica can never be located on the same node as the primary. With a 3 data node cluster you can only have 2 working replicas for each primary shard, the last replica will remain unassigned so you should set the number_of_replicas for each index to 2.

Thanks for the reply @Bernt_Rostad and your recommendation on shard sizes. I wrote this in my original post, though:

I realize this is severely oversharded, but I am trying to understand how the different shards affect the recovery process.

I was wondering why primaries recover nearly instantly, but replica's seem to be regenerated from scratch instead of being recovered. That seems to me either a bug or a very inefficient way since the data should already be there on the node. Having an arbitrary high shard count just made this issue more apparent.

About the replicas: I must've used the wrong terminology. I have a replication factor of 3 by setting number_of_replicas to 2. So on a 3 nodes cluster every index has 1 primary and 2 replicas = 3 copies of each shard.

Thank you for clearing up that misunderstanding. Then your cluster setup should work fine without any unassigned replicas lingering.

As for the original problem, I don't know the internal workings of Elasticsearch that well so I'm afraid I can't explain why the replicas seem to recover from scratch in your cluster. The only thing I can suggest is that the slow recovery is somehow related to the high number of shards in the cluster.

In case you're uncertain about the important differences between a primary and a replica shard, which holds the same data and thus may look the same, I'll add a few words.

It's important to keep in mind that a replica is not a crucial element of an index whereas each primary shard is vital for the index operation. An index will be in a red state until all of its primary shards have been recovered, you can't index into it and not get a sensible search result back from such an index. Hence, from an operational perspective it's much more important to recover the primary shards than the replicas. If there is competition for system resources, perhaps a throttling of I/O, it makes sense for Elasticsearch to focus its limited resources on the primary shards and only when they have all been recovered, and the index goes from red to a yellow state, start recovering the replicas.

In order to improve your cluster performance I still believe the best line of action would be to look at the number of shards, both primary and replica, and try to lower those numbers. One of the recommendations in the earlier mentioned blog post is to aim for shard sizes of 20-40 GB, but I have also heard 30-50 GB from Elastic technicians. So for my indices I try to keep the shards in that range (brand new shards will start out small, naturally, but should grow to 20-50 GB). For instance, if I estimate a monthly log index to take about 100 GB of data, I assign 3 primary shards to it. This way I reduce the overall shard count, so that if each node has say 600 GB of data there will be just 20-30 shards on each.

When it comes to the number of replica shards in an index I once tried setting it to N-1 in a cluster of N data nodes, naively thinking it would be best to have the complete data set on each node both for scaling and for redundancy. But that is wrong for several reasons.

  1. It doesn't scale at all. Since every node has a full copy of the index data it means once one node is getting full (at 85% disc usage, no new shards can be assigned to the node) all of them are. The only way to scale the cluster for more data is to increase disc capacity on each node by mounting more discs and update the path.data setting in Elasticsearch.yml. This is both cumbersome and very error prone. A much better way to scale a cluster for more data is to add new nodes, allowing shards to be moved from the nearly full nodes to the empty ones, and this only works if the cluster isn't designed to have a full copy of index data on each node.

  2. It makes low level caching impossible. By having a full copy of the index data on each node you will probably end up with many hundred gigabytes or even terabytes of data and there is no way to cache more than a tiny fraction of that in the low level caches, which means Elasticsearch will have to read just about every piece of data from disc rather than use the cache to return query results. If, on the other hand, data is spread out across many nodes you may have only 2-300 GB of data per node which means a larger fraction of the data can be cached between requests, making Elasticsearch respond much faster.

  3. It gives a false sense of redundancy. One reason for having N-1 replicas is based on the flawed idea that having a full copy of all indices on every node will allow you to keep the cluster running even if most nodes die. But this is not so because if N-2 nodes die it's probably just a question of time before the last one does too, especially since it will have to handle all the index and search requests all alone, which will quickly overwhelm it. A better policy is to monitor the cluster closely and make sure to fix the problem if one node dies, before more nodes go down.

To summarize, I would recommend:

  • reducing the number of replica shards for all indices to just 1.
  • using Shrink Index or Reindex API to reduce the number of primary shards per index to obtain shard sizes in the 20-50 GB range.

But whether this solves your slow replica recovery I don't know.

Good luck!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.