New indices remain yellow when there are relocations in the cluster

Hi,

When we reduce the number of nodes in our cluster to half, a lot of recoveries start in the cluster as a result of excluding the old ips. We are running a 120 node cluster with a total of 4M docs/second. There are 60 primaries and 60 replicas in the latest index. During recovery, if rollover happens- new indices start as yellow and remain yellow for a long time until rolled-over. The new rolled over index comes up as yellow again till recoveries complete. I also noticed that the shards remain unassigned for 5 minutes. Allocation explain API shows that recoveries limit on the node is breached. I was wondering if we can have a different throttling limit for newly created indices as the primaries are empty and replicas can be started quite fast with minimal network bandwidth as opposed to other shards that are being moved. The primaries come up quite fast as they rely on a different throttling limit- node_initial_primaries_recoveries.
This is impacting the availability of our new indices and any node going down will make the cluster red during scale-down.
We even evaluated waiting for index to become green before alias switch, but it takes close to a few minutes before the existing recoveries finish and replicas are assigned and will also depend on the network bandwidth and existing concurrent recoveries. It makes client side handling of the rollover difficult.

How do you suggest solving this problem?

Do you adjust priority as your indexes age?

Here's what I think is happening, but mostly posting to follow the thread :slight_smile:

Your recoveries have you at the "limit" of concurrent recoveries. When a new index is allocated, it needs to "recover" the replica. That can't happen until at least the next recovery finishes. If the new index is a higher priority than the recovering indexes it should be next. If it's the same, it's probably FIFO and it's going to wait till last.

Our recoveries were taking longer than they should and we weren't using much percentage of our available bandwidth between nodes. Increasing our settings made the recoveries faster, so I stopped looking at the similar problem I was researching.

"indices" : {
"recovery" : {
"max_bytes_per_sec" : "500mb",
"max_concurrent_file_chunks" : "5"
}

These settings seemed to have different (slower) defaults at various Elastic versions.

There are 2 problems here:

  1. Replica created as UNASSIGNED and remains in that state for 5 minutes.
  2. Replica remains in INITIALIZING state for a long time after UNASSIGNED in some cases.

The second case happens as a result of the first case as translog is accumulated over the 5 minute interval and applying the translog takes longer as our indexing rate is high as well and applying translog with higher indexing rate seems to be slow, in my experience. I am more interested in fixing the first scenario so that the second one is avoided as a result.

We have not adjusted any priorities. I assume the priorities are based on index creation times in that case.

You are right about the problem and why it happens, but till the ongoing recoveries finish- there should be a way to start replicas for new index as primaries are empty for it and replicas would be assigned quite fast, irrespective of existing ongoing recoveries.

Our defaults are at a much higher value than this, but it would still take some time for replicas to get assigned and we start indexing as soon as the index is rolled over- which makes a small window for our data to be lost.