Tragic failure of primary marks all replica failed

xuanyuan300 · February 28, 2026, 2:45am

Hi team

We recently encountered the same problem Index with multiple replicas turned red when node with primary went down . Are there any plans to fix it?

DavidTurner · March 2, 2026, 11:15am

It’s on the list to fix still, but it’s not a very high priority right now as we just don’t encounter these tragic failures often enough in practice.

xuanyuan300 · March 3, 2026, 2:04am

Thanks for the update.

We understand that this issue is currently considered low priority.However, we would like to provide additional operational context to help assess the real-world likelihood and impact of this failure mode.

1. Probability of I/O errors (EC2 local SSD context)

Our Elasticsearch cluster runs on EC2 instances using local NVMe

SSDs, which come with an advertised availability of ~99.5%.

This implies that - statistically - a single local SSD can experience roughly 0.5% annual unavailability, which corresponds to

~44 hours per year of potential I/O anomalies or interruptions that can realistically lead to I/O errors durin replication

Because this failure mode depends on underlying disk I/O behavior rather than truly rare events, we believe the real-world likelihood is higher than assumed, especially for clusters under heavy write load or intensive replication.

2. Real-world impact severity

When a primary encounters an I/O error during replication, all replicas are marked as stale, even though the write was never acknowledged.This effectively turns a single-node or single-disk I/O issue into a multi-replica invalidation, significantly increasing the duration and blast radius of a failure.

This behavior reduces cluster resilience and contradicts the expectation that unacknowledged writes should not invalidate healthy replicas.

3. Request for clarification on failure likelihood

Could the Elastic team share any insights into:

how often "primary fails due to I/O error during replication' occurs in internal testing or telemetry?

whether this scenario is expected to be extremely rare, or simply underreported?

Given the 99.5% availability profile of standard cloud local SSDs, our operational experience suggests that I/O anomalies are not vanishingly rare.

4. Request to reconsider priority or provide mitigations

Considering the above, could you reconsider increasing the priority of this fix?

If that is not possible, we would greatly appreciate any recommendations on:

configuration changes that could prevent replicas from being incorrectly marked stale, or mitigation strategies to reduce the impact of this failure mode.

Thanks again for your time and support, we eally appreciate any furtherinsights you can share.

RainTown · March 3, 2026, 8:11am

I have no opinion on the priority of the issue you raised, but wish you luck with that.

Do you have a reference for that availability SLA ?

"we believe" is doing some pretty heavy lifting there. I am not at all sure how you could quantify the claim, noting that "underlying disk I/O behavior" is a bit vague.

DavidTurner · March 3, 2026, 9:20am

We are definitely using local SSDs on EC2 in Elastic Cloud Hosted but I do not believe we see anything close to this kind of IO error rate across the fleet. Unfortunately I don’t have access to the exact details of how the storage is configured on these hosts, but I’m pretty sure there’s some level of RAID involved.

xuanyuan300 · March 4, 2026, 12:28am

Thanks for the earlier replies.

As background, the EC2 instance-level SLA provides 99.5% monthly uptime per single instance. We referenced this only to illustrate that single-node or host-level anomalies do occur in practice. This was not intended to imply any specific media-error rate-only to acknowledge that host-level interruptions are not zero in real-world production environments. The assumption was simply that, in the worst case, any underlying fault could trigger an I/0 error, even though such events typically do not occur.

Separately, based on our current understanding of the failure mode.when a disk I/O error occurs on the primary shard, Elasticsearch may close the primary and cancel its ongoing replication operations. Because the replica-side bulk operations are caneeled as a downstream effect of the primary becoming unavailable, the systern then incorrectly marks the replica shards as stale as well. This effectively causes both the primary and all replicas to fail, turning the index red.

Could you confirm whether this description matches the current known behavior of the issue?

Topic		Replies	Views
Index with multiple replicas turned red when node with primary went down Elasticsearch	10	418	October 21, 2023
Rolling restart triggers primary-replica resync leading to write unavailability Elasticsearch	10	653	September 12, 2023
Partial index replication causes data loss? Elasticsearch	8	634	October 24, 2014
Continuous Disk Failure in ElasticSearch Nodes in heavy bulk indexing environments Elasticsearch	4	1250	September 22, 2015
Read/Write consistency Elasticsearch	5	3658	May 2, 2014

Tragic failure of primary marks all replica failed

Related topics