What happens when I increase replica count?

Hello,

We have a 18 node elastic search cluster. We have a few indices but lets talk about one specific index. It has 60 shards with a replica of 1. There are about 11 million documents and the index size is about 850GB.

We are getting into a process of re-indexing regularly. So as part of re-indexing, among other settings, I have set the new index's replica=0. Lets say we complete the re-indexing operation in a few hours and once re-indexing completes, we will set replica=1.

Here are the questions I have:

  1. What happens after I increase replica of a newly created index from 0 -> 1.
  2. How long does it take ES to create the new replica shards for all these 60 shards for the new index?
  3. Does ES cluster health turn yellow when I increase the replica count? Alternatively, how does ES communicate to us that its replicating the shards?
  4. What should I monitor to know when all the 60 shards have replicated?
  5. Should I wait to switch the alias from the older index to the newer index till the replication process has completed? In other words, can the new index start taking live traffic (both indexing and search traffic) while the replica are still being setup?

From what I have read, indexing is done on the primaries while searching can be done on the replicas. So I guess its safe to continue to index but not sure about searching.

Please advise. I searched for answers but couldnt find anything relevant.

We are on v 6.3 of Elastic Search.

Thanks!

Elasticsearch creates the new replica by making a file-by-file copy of the primary, then replays some indexing operations to bring the replica up to date. The details depend a lot on the version, but in recent versions (7.5+) once you stop indexing there should be no operations to replay. 6.3 has to replay more operations so that might take a while.

It completely depends on how fast it can copy the data over.

Yes, the health will be yellow until the new shards are started.

Cluster health, or maybe the indices recovery API if you want to see more detailed progress.

The new index can indeed start taking live traffic, both indexing and searching, as soon as it is yellow. All the search traffic will be sent to started shards only (initially just the primaries) which might be too much.

That's fantastic. Thanks for the quick reply.

We will try the other alternative as well - with a replica of 1 to start with and see which approach fares better. I guess its okay to have the index in a yellow state especially if its quick enough (overnight).

Hello @DavidTurner,

We gave it a try by setting number_of_replicas to 0 among other things. The indexing process did complete successfully. Then we updated the number_of_replicas setting to 1. The moment we update the replica count to 1, we noticed the cluster health turn yellow. In ES-HQ. we noticed the newly created index in yellow state as well. Querying the cluster settings showed that shards were being created, etc.

While all that was fun and we were expecting to see that, on the bad side of things here is what happened:

  • CPU utilization on our data nodes spiked up to about 80-90% constant (typical usage is under 20%)
  • We have DataDog integration and it started showing a few nodes in red state (we assumed it wasnt able to receive any metrics from those nodes)
  • Our search queries on the older (production index) took hit and search latency spiked up significantly

We waited in that state for about 45 min and couldnt take the risk. So we ended up deleting the newly created index. And almost instantly things went back to normal.

Based on the observations, here are more questions:

  1. Is it really a recommended practice to start with zero replicas and then update them later?
  2. If so, how do we deal with increased cpu utilization, increased search latency and nodes being reported as unhealthy since the newly build index and the older production index are on the same set of data nodes.
  3. What if we build the new index with initial replica set to 1. We are going to give this a try and see what works best for us.

Thanks!

You haven't said which version you're using, and the answer depends on that.

If you want to diagnose high CPU usage then the hot threads API is a good place to start. Since you suspect it's related to replica recovery, check the indices recovery API too. Can you share those outputs here?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.