We are facing an issue with the store volume hardware on a given data node which results in full production outage. While the problem is clearly lies in the system/hardware level there might be some Elasticsearch-level options to mitigate it.
Elasticsearch data nodes use a separately mounted volume as a
path.data location to store all the data. Approximately on monthly basis one of the data nodes' store volume experiences a hardware issue which results in a forceful shutdown of the XFS filesystem and volume not being mounted anymore.
Mount point directory, however, still exists on the root volume. After a system restart that follows the store volume error (we can't reproduce this though), Elasticsearch effectively starts using the root volume for the data. Root volume is quite small so within a few minutes Elasticsearch fills up the disk. After breaching the
flood_stage watermark Elasticsearch marks indices as read-only which results in a service outage.
Timing (~5-10mins) and unpredictability of this event leaves no space for pro-active manual intervention to avoid the production outage.
We'd like to automate the mitigation to ensure that whenever a given data node loses its data store volume it is automatically "excluded" from the cluster.
We figured that we could prevent Elasticsearch process from starting using
But it is unclear whether a system reboot happens consistently on store volume failure or not. And hence we can't rely on Elasticsearch being restarted in the event.
We are investigating whether using
fs.xfs.panic_mask could help to force system reboot on the store volume issue but this setting intended for debugging purpose only - so unclear how safe would it be to use it in production.
Running on EC2, Amazon Linux 2.
ES version 6.8.3 (7.9.3 upgrade pending soon).
Data nodes are using
m5d.4xlarge EC2 instance type, with two 300GB NVMe volumes mounted as a RAID0.
- Did anyone else experience a similar issue?
- Is there any Elasticsearch-level options/ways to ensure Elasticsearch process stops working in case of a store volume level failure?
Any info/ideas/suggestions more than welcome.