Rolling restart triggers primary-replica resync leading to write unavailability

When a node is shutdown during a normal rolling restart, we end up in a loss of write availability for a period of 10 mins.

We're using ECK operator to manage the cluster. It's 7.17 and the operator is the latest version.

  1. The node goes down, prompting a replica promotion.
  2. The replicas are not in-sync, which causes a primary-replica resync operation.
  3. During that time, bulk requests timeout.
  4. That can last over 10 mins.

Some background on the cluster:

  • Our requests are bulked together for all indices. This means all requests end up timing out during while this one shard in unavailable.
  • We have 5 primary, 2 replica shards for these heavy indicies.
  • We index with wait_for_active_shards=1 . That means we are not waiting for replicas to receive the updates.
  • Each node houses ~80 shards currently and we have 80 nodes. They are not small nodes, but they're on HDDs. It's >6000 shards and >800 indices.
  • One index is always busy (~50,000 reqs/s). The rest are not consistently busy, but can be spiky. It's not just the very busy index where we see primary-replica resyncs happening.
  • We can sometimes be processing large DeleteBy queries on those indices.
  • I realised the refresh_interval on this write heavy indices was abnormally high (60min). However, we do see these primary-replica resyncs on other indices so I don't think this is the issue.

I have some questions:

  • In this scenario, with wait_for_active_shards=1, it seems we are likely losing data when the node goes away? Is that correct? This talk from 2017 states that writes/replication are sync but if that's the case the meaning of wait_for_active_shards isn't clear to me.
  • Is this replica-resync behaviour expected? It seems abnormally slow, even at the high write rate.
  • I didn't mention it above, but these recoveries can be excruciatingly slow also. Some take 4 hours. This can really slow down the time it takes to do a rolling restart.

Why are the replicas not in sync? Have you any non-default settings that could impactv this, e.g. around async replication?

Given that you have a high indexing rate, index into a lot of shards and use HDDs, is it possible you are overloading the I/O capacity of some nodes causing them to fall behind?

Why are the replicas not in sync?

I made the assumption they aren't out of sync because a primary-replica resync occurs. Nothing logs about being out of sync - and looking at _cluster/state at any given time lists all shards in in_sync_allocations. Note that we don't have information about how out of date anything is. I tried "logger.org.elasticsearch.index.seqno.ReplicationTracker": "TRACE" but it was too spammy

Have you any non-default settings that could impactv this, e.g. around async replication?

To save time - here's the non default settings we have that seem in any way relevant:

$ curl -s -XGET '127.0.0.1:9200/_cluster/settings?flat_settings=true'
 {
   "persistent": {
     "cluster.routing.allocation.cluster_concurrent_rebalance": "20",
     "cluster.routing.allocation.node_concurrent_recoveries": "10",
     "cluster.routing.allocation.node_initial_primaries_recoveries": "10",   
     "indices.recovery.max_bytes_per_sec": "250mb",
     "indices.recovery.max_concurrent_file_chunks": "5"                      
   }
 }

$ curl -s -XGET '127.0.0.1:9200/bigidx/_settings?flat_settings=true'
{
  "bigidx": {
    "settings": {
      "index.codec": "best_compression",
      "index.number_of_replicas": "2",
      "index.number_of_shards": "5",
      "index.refresh_interval": "60s", # NOTE: used to be 3600s - haven't tried RollingRestart with 60s.
      "index.translog.durability": "async",
      "index.unassigned.node_left.delayed_timeout": "15m"
    }
  }
}

The biggest setting that could impact this could be:

  • index.translog.durability: this clearly affects the durability of data in the translog. however - i don't understand how it would play into a primary-replica resync because the secondary/tertiary shards are not going down.

  • index.refresh_interval - clearly 3600s is going to cause a lot less refreshs/flushes to occur. I guess this would affect how often replicas are forced to be in sync. But given we saw this occur on indicies that the default 1s interval, i'm not sure it is a sole contributor.

  • These are an attempt to mitigate very slow recoveries. it's possible allowing the node to throttle be during the primary-replica resync operation?

Specific breakdown of what happens

Here's an example of this event happening in more detail. Default logging generally didn't provide enough info to debug, but we increased verbosity and got more detailed information. Here's an example breakdown from when this last occurred:

  • 12:52:?? - datanode 27 is scheduled for deletion (with shutdown API)

  • 12:52:09 - datanode 7 detects new primary, primary-term 35 . this is when bulk requests start to timeout.

[bigidx][1] detected new primary with primary term [35], global checkpoint [36301557005], max_seq_no [36301561545]`)
  • 12:52:24 - The pod in k8s actually gets deleted. We see some logs from datanode-27 logs some erroring on bulk requests. "node closed while execution action". These are using old primary term 34.

  • 12:57:58 - primary replica resync is completed. datanode-27 logs this. At this point, lag is recovered.

{"level": "INFO", "component": "o.e.i.s.IndexShard", "cluster.name": "eck", "node.name": "eck-es-datanodes-2-36", "message": " ==[bigidx][1]== primary-replica resync completed with 4540 operations", "cluster.uuid": "XXX-XXX", "node.id": "XXX" }
  • 13:10 - datanode-27 starts [bigidx][1] as a replica after completing a recovery.

Given that you have a high indexing rate, index into a lot of shards and use HDDs, is it possible you are overloading the I/O capacity of some nodes causing them to fall behind?

I will try to see if we have historical metrics of use to check here. We are already planning on moving write heavy indices to SSDs but I'm trying to determine if there might be something else at fault here.

I think these settings explain what you're seeing. HDDs are terrible at handling concurrent workloads, and yet with these settings you are telling Elasticsearch that it's ok to do an enormous amount of concurrent I/O for recoveries on top of your regular write traffic. You need to set them back to their defaults so ES has at least a hope of keeping up with write traffic while recovering shards.

2 Likes

Okay - we'll remove these and see if that helps. I'll update in later next week at some point. Thanks for looking.

This behaviour occurred again during a RollingRestart, with default recovery settings. On the upside, it seems to happen less frequently now.

Before we saw it maybe 1/5 node restarts, we hit it only after 40 node restarts now.
However - we still saw some bad impact - 25 min write unavailability.

Are there any node level statistics that might help in understanding what is causing the bottleneck here? Is it just a case that HDDs fail to keep up on a cluster with a certain ingest load?

It depends on the exact nature of the "write unavailability". Can you share a response from ES during this time showing the error it's returning?

On the client side we just see a timeouts. We can raise timeouts to as high as we'd like but no requests to the _bulk endpoint will return until after that completes. The log we see is something like:

{
  "type": "server",
  "timestamp": "2023-08-14T19:30:50,181Z",
  "level": "INFO",
  "component": "o.e.i.s.IndexShard",
  "cluster.name": "eck",
  "node.name": "eck-es-datanodes-2-64",
  "message": " [<index_name>][2] primary-replica resync completed with 33 operations",
}

The node itself doesn't look like it's doing anything interesting. Here is a graph of thread pool operations (visualized on grafana via elasticsearch-exporter) that shows the operations that this datanode is doing, we can see nothing happening until that index comes back online.

What ways do people usually look at IO performance in elasticsearch?

I expect this isn't quite true, e.g. if you set the timeout to 60 minutes then you should get a response eventually. You should be able to carry on sending operations to ES until it starts sending 429s back: even if it is not responding very quickly, it will batch up all the pending ops to amortize the IO costs, which is probably helpful when using HDDs. Timing out on write ops is not a great idea, especially if you then just retry them, because ES will keep the original op in its queue and will eventually process it, so if you put retries in its queue too then you're just adding extra unnecessary work.

However, it's useful to know that ES isn't actually responding with errors, it's just busy. In that case, I would look at the threadpool stats (particularly the number of active threads and the queue length) and the hot threads API (with ?threads=9999) to see the detail of what it's doing/waiting on. If you capture these things next time then we'll help you interpret what you're seeing and maybe pin down a bug you're hitting.

Timing out on write ops is not a great idea, especially if you then just retry them, because ES will keep the original op in its queue and will eventually process it, so if you put retries in its queue too then you're just adding extra unnecessary work.

Do people often run bulk indexing APIs with very high or 0 timeout? It seems unusual

hot threads API (with ?threads=9999 ) to see the detail of what it's doing/waiting on. If you capture these things next time then we'll help you interpret what you're seeing and maybe pin down a bug you're hitting.

I'll try and capture this.

I think the unusual thing here is running such a sizeable ES cluster backed by HDDs. Normally the write latency is much lower than all reasonable request timeouts so the question doesn't really arise. But yeah timeout-and-retry is very much a bad idea.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.