When a node is shutdown during a normal rolling restart, we end up in a loss of write availability for a period of 10 mins.
We're using ECK operator to manage the cluster. It's 7.17 and the operator is the latest version.
- The node goes down, prompting a replica promotion.
- The replicas are not in-sync, which causes a primary-replica resync operation.
- During that time, bulk requests timeout.
- That can last over 10 mins.
Some background on the cluster:
- Our requests are bulked together for all indices. This means all requests end up timing out during while this one shard in unavailable.
- We have 5 primary, 2 replica shards for these heavy indicies.
- We index with
wait_for_active_shards=1
. That means we are not waiting for replicas to receive the updates. - Each node houses ~80 shards currently and we have 80 nodes. They are not small nodes, but they're on HDDs. It's >6000 shards and >800 indices.
- One index is always busy (~50,000 reqs/s). The rest are not consistently busy, but can be spiky. It's not just the very busy index where we see primary-replica resyncs happening.
- We can sometimes be processing large DeleteBy queries on those indices.
- I realised the refresh_interval on this write heavy indices was abnormally high (60min). However, we do see these primary-replica resyncs on other indices so I don't think this is the issue.
I have some questions:
- In this scenario, with
wait_for_active_shards=1
, it seems we are likely losing data when the node goes away? Is that correct? This talk from 2017 states that writes/replication are sync but if that's the case the meaning ofwait_for_active_shards
isn't clear to me. - Is this replica-resync behaviour expected? It seems abnormally slow, even at the high write rate.
- I didn't mention it above, but these recoveries can be excruciatingly slow also. Some take 4 hours. This can really slow down the time it takes to do a rolling restart.