During a rolling restart sometimes all replicas of a single shard go into PRIMARY_FAILED

rmb938 · March 9, 2022, 11:09pm

I have an issue where sometimes during a rolling restart when it gets to a node that has a primary replica once the node is offline the replica shards go into a PRIMARY_FAILED state.

i.e

my-index               11 p UNASSIGNED NODE_LEFT 
my-index               11 r UNASSIGNED PRIMARY_FAILED
my-index               11 r UNASSIGNED PRIMARY_FAILED

This doesn't seem to happen all the time and I can't really find a way to make it consistently happen.

According to the documentation this means The shard was initializing as a replica, but the primary shard failed before the initialization completed.

How do I prevent this? I am restarting one node at a time and wiating for all shards to be allocated and have the cluster in a green state before moving onto the next node. Shard allocation is turned off before each node is taken down and turned back on when brought online.

I can't really find any documentation that says how to prevent this and I am following all the steps here Full-cluster restart and rolling restart | Elasticsearch Guide [8.1] | Elastic

So I am not really sure what is causing this. Any ideas or insights would be great!

warkolm · March 9, 2022, 11:23pm

Welcome to our community!

What do the logs on your master node show at this time for that index?

rmb938 · March 10, 2022, 12:45am

Unfortunately not much.

I see shard allocation being turned off, the node leaving the cluster, logs about marking unavailable shards as stale (posted bellow), then a bit later the node joining the cluster and allocation being turned back on.

{"type": "server", "timestamp": "2022-03-09T22:40:52,226Z", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "my-cluster", "node.name": "my-cluster-es-master-4", "message": "[my-index][7] marking unavailable shards as stale: [cyzHRssCRd-PJ8FYF9zAGQ]", "cluster.uuid": "nVZb27XkRkmc5vsbGxqfng", "node.id": "jZz5usD4SSKR4hwrJRJCtw"  }
{"type": "server", "timestamp": "2022-03-09T22:40:53,107Z", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "my-cluster", "node.name": "my-cluster-es-master-4", "message": "[my-index][2] marking unavailable shards as stale: [iRQsE2dkQ2qgoKJup7PmFw]", "cluster.uuid": "nVZb27XkRkmc5vsbGxqfng", "node.id": "jZz5usD4SSKR4hwrJRJCtw"  }
{"type": "server", "timestamp": "2022-03-09T22:40:53,456Z", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "my-cluster", "node.name": "my-cluster-es-master-4", "message": "[my-index][6] marking unavailable shards as stale: [IABho5n9TJqOdYFy_K90Yw]", "cluster.uuid": "nVZb27XkRkmc5vsbGxqfng", "node.id": "jZz5usD4SSKR4hwrJRJCtw"  }
{"type": "server", "timestamp": "2022-03-09T22:40:53,491Z", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "my-cluster", "node.name": "my-cluster-es-master-4", "message": "[my-index][11] marking unavailable shards as stale: [ChD9fRVhSyeh7EuGT_N_Fg]", "cluster.uuid": "nVZb27XkRkmc5vsbGxqfng", "node.id": "jZz5usD4SSKR4hwrJRJCtw"  }

No logs in any of my master nodes about anything else. I'm using the default out of the box logging setup.

rmb938 · March 16, 2022, 11:43pm

@warkolm Any ideas on what to check next?

system · April 13, 2022, 11:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why shard unassigned after cluster restart completely? Elasticsearch	1	384	May 28, 2020
All copies of shards in unassigned state after intermittent connectivity issue with master Elasticsearch	1	497	December 27, 2016
Why is replica shard in UNASSIGNED state if it exists on disk? Elasticsearch	4	4399	March 2, 2018
[1.7.2] unassigned shards after restart for indexes with no replication Elasticsearch	7	1870	July 5, 2017
Shards unassigned after node restarts - reason: NODE_LEFT Elasticsearch	16	35691	December 28, 2016

During a rolling restart sometimes all replicas of a single shard go into PRIMARY_FAILED

Related topics