Handling unmounted data volume

Hello!

We are facing an issue with the store volume hardware on a given data node which results in full production outage. While the problem is clearly lies in the system/hardware level there might be some Elasticsearch-level options to mitigate it.

Problem description

Elasticsearch data nodes use a separately mounted volume as a path.data location to store all the data. Approximately on monthly basis one of the data nodes' store volume experiences a hardware issue which results in a forceful shutdown of the XFS filesystem and volume not being mounted anymore.
Mount point directory, however, still exists on the root volume. After a system restart that follows the store volume error (we can't reproduce this though), Elasticsearch effectively starts using the root volume for the data. Root volume is quite small so within a few minutes Elasticsearch fills up the disk. After breaching the flood_stage watermark Elasticsearch marks indices as read-only which results in a service outage.

Timing (~5-10mins) and unpredictability of this event leaves no space for pro-active manual intervention to avoid the production outage.
We'd like to automate the mitigation to ensure that whenever a given data node loses its data store volume it is automatically "excluded" from the cluster.

We figured that we could prevent Elasticsearch process from starting using systemd AssertPathIsMountPoint condition.
But it is unclear whether a system reboot happens consistently on store volume failure or not. And hence we can't rely on Elasticsearch being restarted in the event.
We are investigating whether using fs.xfs.panic_mask could help to force system reboot on the store volume issue but this setting intended for debugging purpose only - so unclear how safe would it be to use it in production.

Cluster information

Running on EC2, Amazon Linux 2.
ES version 6.8.3 (7.9.3 upgrade pending soon).
Data nodes are using m5d.4xlarge EC2 instance type, with two 300GB NVMe volumes mounted as a RAID0.

Questions

  • Did anyone else experience a similar issue?
  • Is there any Elasticsearch-level options/ways to ensure Elasticsearch process stops working in case of a store volume level failure?

Any info/ideas/suggestions more than welcome.

Thank you.

Yes, as of 7.9.0 (https://github.com/elastic/elasticsearch/pull/52680) a node will remove itself from the cluster if its filesystem goes read-only. Making the filesystem go read-only when it encounters an error is up to you. ext* filesystems have a mount option errors=remount-ro, not sure about XFS. Unmounting the filesystem on an error sounds like a very bad idea; marking it as read-only and letting it return errors to the application is much safer and is the expected behaviour.

That sounds excessively lenient. A failure to mount a filesystem should be a fatal error, you shouldn't be letting the rest of the system start up if it encounters a mount failure.

2 Likes