Hi team
We recently encountered the same problem Index with multiple replicas turned red when node with primary went down . Are there any plans to fix it?
Hi team
We recently encountered the same problem Index with multiple replicas turned red when node with primary went down . Are there any plans to fix it?
It’s on the list to fix still, but it’s not a very high priority right now as we just don’t encounter these tragic failures often enough in practice.
Thanks for the update.
We understand that this issue is currently considered low priority.However, we would like to provide additional operational context to help assess the real-world likelihood and impact of this failure mode.
1. Probability of I/O errors (EC2 local SSD context)
Our Elasticsearch cluster runs on EC2 instances using local NVMe
SSDs, which come with an advertised availability of ~99.5%.
This implies that - statistically - a single local SSD can experience roughly 0.5% annual unavailability, which corresponds to
~44 hours per year of potential I/O anomalies or interruptions that can realistically lead to I/O errors durin replication
Because this failure mode depends on underlying disk I/O behavior rather than truly rare events, we believe the real-world likelihood is higher than assumed, especially for clusters under heavy write load or intensive replication.
2. Real-world impact severity
When a primary encounters an I/O error during replication, all replicas are marked as stale, even though the write was never acknowledged.This effectively turns a single-node or single-disk I/O issue into a multi-replica invalidation, significantly increasing the duration and blast radius of a failure.
This behavior reduces cluster resilience and contradicts the expectation that unacknowledged writes should not invalidate healthy replicas.
3. Request for clarification on failure likelihood
Could the Elastic team share any insights into:
how often "primary fails due to I/O error during replication' occurs in internal testing or telemetry?
whether this scenario is expected to be extremely rare, or simply underreported?
Given the 99.5% availability profile of standard cloud local SSDs, our operational experience suggests that I/O anomalies are not vanishingly rare.
4. Request to reconsider priority or provide mitigations
Considering the above, could you reconsider increasing the priority of this fix?
If that is not possible, we would greatly appreciate any recommendations on:
configuration changes that could prevent replicas from being incorrectly marked stale, or mitigation strategies to reduce the impact of this failure mode.
Thanks again for your time and support, we eally appreciate any furtherinsights you can share.
I have no opinion on the priority of the issue you raised, but wish you luck with that.
Do you have a reference for that availability SLA ?
"we believe" is doing some pretty heavy lifting there. I am not at all sure how you could quantify the claim, noting that "underlying disk I/O behavior" is a bit vague.
We are definitely using local SSDs on EC2 in Elastic Cloud Hosted but I do not believe we see anything close to this kind of IO error rate across the fleet. Unfortunately I don’t have access to the exact details of how the storage is configured on these hosts, but I’m pretty sure there’s some level of RAID involved.
Thanks for the earlier replies.
As background, the EC2 instance-level SLA provides 99.5% monthly uptime per single instance. We referenced this only to illustrate that single-node or host-level anomalies do occur in practice. This was not intended to imply any specific media-error rate-only to acknowledge that host-level interruptions are not zero in real-world production environments. The assumption was simply that, in the worst case, any underlying fault could trigger an I/0 error, even though such events typically do not occur.
Separately, based on our current understanding of the failure mode.when a disk I/O error occurs on the primary shard, Elasticsearch may close the primary and cancel its ongoing replication operations. Because the replica-side bulk operations are caneeled as a downstream effect of the primary becoming unavailable, the systern then incorrectly marks the replica shards as stale as well. This effectively causes both the primary and all replicas to fail, turning the index red.
Could you confirm whether this description matches the current known behavior of the issue?
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.