[BUG] A replica shard in POST_RECOVERY state, when promoted to primary, will be stuck

Govind_Balaji_S · September 30, 2025, 9:16pm

EDIT: I think I mixed up ShardRoutingState and IndexShardState.

I think this introduced a bug - A replica can be promoted and started in one cluster state update by bleskes · Pull Request #32042 · elastic/elasticsearch · GitHub

This commit maybe misses that currentRouting.initializing() does not include the IndexShardState.POST_RECOVERY. It was intended to fix this kind of a bug for more general cases like this but looks like this case might have been dropped in the refactor.

Steps to reproduce

(Non-deterministic)

After a restart of a 4 node cluster, (the index of interest in it is a single shard index with replicationFactor = 2)
We saw all replicas of that single shard stuck in "INITIALIZING", while the primary shard had "STARTED" state.

Indexing kept failing with

{"type":"retry_on_primary_exception","reason":"shard is not in primary mode","index":"***","shard":"0","index_uuid":"***","caused_by":{"type":"shard_not_in_primary_mode_exception","reason":"CurrentState[STARTED] shard is not in primary mode","index":"***","shard":"0","index_uuid":"***"}

It appeared like an indexShard's shardRouting.primary was set to true, but replicationTracker.primary was not. Tracing code, this looks like the bug?

Version - opensearch 2.19.1 (this code path is still in elasticsearch master too)

system · September 30, 2025, 9:16pm

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance. See What is OpenSearch and the OpenSearch Dashboard? | Elastic for more details.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

Topic		Replies	Views
Unassigned shard with inconsistent primary state and doc count differences Elasticsearch	17	1191	November 14, 2019
Shards stuck in INITIALIZING state after CLUSTER restart Elasticsearch	4	2234	February 7, 2018
[v1.5.1] Replica shard stuck initializing and can't read stats for primary shard Elasticsearch	4	1395	July 5, 2017
Corrupt primary shard, how to recover from replica shard? Elasticsearch	3	369	March 4, 2024
During a rolling restart sometimes all replicas of a single shard go into PRIMARY_FAILED Elasticsearch	4	1196	April 13, 2022

[BUG] A replica shard in POST_RECOVERY state, when promoted to primary, will be stuck

Steps to reproduce

Related topics