Checking elasticsearch health during rolling restart

Hello! When doing a rolling upgrade (https://www.elastic.co/guide/en/elasticsearch/reference/7.3/rolling-upgrades.html), is checking that there are 0 initializing and relocating shards a reliable way to determine if you can restart the next node as opposed to checking that the "status" of the cluster is "green"?

Background

I'm writing a script to do a rolling upgrade on my 7.3.2 elasticsearch cluster deployed in AWS. For me that means (at a high level):

for each ec2 instance:
  1. terminate the instance
  2. wait for an ASG to spin up an instance to replace the terminated one
  3. wait for ES cluster to be "healthy" before terminating the next node

My question revolves around (3). The rolling upgrade documentation seems to suggest to first wait for the cluster to become "green" but if it doesn't then you can continue the rolling upgrade if there are no initializing or relocating shards. Instead of first checking that the cluster is "green" can I just check that there are no initializing or relocating shards? That would make the script logic simpler.

Excerpt from the documentation:

Before upgrading the next node, wait for the cluster to finish shard allocation. You can check progress by submitting a _cat/health request. Wait for the status column to switch from yellow to green . Once the node is green , all primary and replica shards have been allocated.
IMPORTANT:
During a rolling upgrade, primary shards assigned to a node running the new version cannot have their replicas assigned to a node with the old version. The new version might have a different data format that is not understood by the old version. If it is not possible to assign the replica shards to another node (there is only one upgraded node in the cluster), the replica shards remain unassigned and status stays yellow . In this case, you can proceed once there are no initializing or relocating shards (check the init and relo columns). As soon as another node is upgraded, the replicas can be assigned and the status will change to green .

Thanks for any and all advice!

One issue that can happen with the first node. If Lucene is upgraded and a new index happens to get created, I think it allocates on the highest level Lucene nodes only. In my case, it refused to allocate it's replica on a lower-level Lucene node, so the cluster never goes green. Most of our indices are 1 shard, but I wonder if one tried to create more than one shard, I think it would fail.

This happened more frequently on my test cluster with quick rollover, so far, I've never seen it happen on production-like clusters.

Also, we are now upgrading all nodes on a rack at the same time, it's quicker than a node at a time :-).

I do same. shutdown everything and upgrade everything. done is less then 15 min top.
But then my data are not mission critical.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.