EDIT: I think I mixed up ShardRoutingState and IndexShardState.
I think this introduced a bug - A replica can be promoted and started in one cluster state update by bleskes · Pull Request #32042 · elastic/elasticsearch · GitHub
This commit maybe misses that currentRouting.initializing() does not include the IndexShardState.POST_RECOVERY. It was intended to fix this kind of a bug for more general cases like this but looks like this case might have been dropped in the refactor.
Steps to reproduce
(Non-deterministic)
After a restart of a 4 node cluster, (the index of interest in it is a single shard index with replicationFactor = 2)
We saw all replicas of that single shard stuck in "INITIALIZING", while the primary shard had "STARTED" state.
Indexing kept failing with
{"type":"retry_on_primary_exception","reason":"shard is not in primary mode","index":"***","shard":"0","index_uuid":"***","caused_by":{"type":"shard_not_in_primary_mode_exception","reason":"CurrentState[STARTED] shard is not in primary mode","index":"***","shard":"0","index_uuid":"***"}
It appeared like an indexShard's shardRouting.primary was set to true, but replicationTracker.primary was not. Tracing code, this looks like the bug?
Version - opensearch 2.19.1 (this code path is still in elasticsearch master too)