Cluster locks up if master node filesystem becomes read-only

Paul_Maseberg · January 1, 2016, 3:29pm

Have a three node cluster running elasticsearch 2.1.0. The master node is also a data node. The machine that was master had hardware issues and the filesystem became read only. The whole cluster locked up until I killed the elasticsearch process on the bad machine.

Is there a setting or something that can be used to make the cluster drop a node that is still running elasticsearch but is having issues like a readonly filesystem?

warkolm · January 1, 2016, 11:51pm

No there's not.

jprante · January 2, 2016, 12:51am

There are two solutions.

First one is ES-only. The reason why ES is not shutting down automatically is because org.elasticsearch.env.NodeEnviroment keeps a java.nio.file.FileStore which is never monitored by calling isReadOnly() method regularly.

The java.nio.file.FileStore of all writable paths would have to be monitored for emergency stop in such kind of event.

To make the cluster drop a node with readonly file store, you would have modify the code and submit a patch.

The second solution: to detect general hardware malfunctions, JVM-based methods are quite not sufficient. Hence, ES is the not the best place to implement that, but the OS. You have to set up server monitoring software which can understand SNMP or IPMI or triggers for mcelog https://github.com/andikleen/mcelog that can kill ES (and other) processes in case of severe events.