Elasticsearch data node does not failover when data disk fails

kkpatel · October 4, 2016, 2:54pm

In my production environment we are using Elasticsearch 2.3.1 and we have a configuration with 1 master dedicated node and 2 data nodes (lets call them dnode1 and dnode2), with 1 shard/index and 1 replica all running on separate servers.

I had a situation where the data drive on one of the data nodes (dnode2) went down. Elasticsearch on dnode2 was still running but the instance could no longer access the data drive where the data directory resided. The cluster still thought dnode2 was available so the master node kept trying to send new documents to dnode2. Sometime later after we discovered what happened, we manually stopped the Elasticsearch instance on dnode2 and then Elasticsearch failed over to dnode1. After the failure and until we stopped Elasticsearch on dnode2, no new entries could successfully get inserted. Additionally, many of the shards didn't get reassigned to the other node (dnode1) and remained in the UNASSIGNED state. I had to manually do a reroute on those indices to assign them to the working dnode1.

We do have our data disk setup in a raid 5 configuration I believe but we have had some unexpected problems with it so I want to know if anyone has come across this issue and what I can do enable Elasticsearch to handle such scenarios.

nik9000 · October 4, 2016, 3:21pm

We've done some work around this in 5.0:

github.com/elastic/elasticsearch

Die with dignity

elastic:master ← jasontedor:die-with-dignity

opened 05:00PM - 05 Jul 16 UTC

jasontedor

+325 -5

Today when a thread encounters a fatal unrecoverable error that threatens the st…ability of the JVM, Elasticsearch marches on. This includes out of memory errors, stack overflow errors and other errors that leave the JVM in a questionable state. Instead, the Elasticsearch JVM should die when these errors are encountered. This commit causes this to be the case. Relates #19231

But to be honest there are probably a lot more things we could do.

A couple of corrections:

The master node is responsible for unassigning the shard copies from dnode2 and for demoting dnode2 if it is the primary for that shard copy but it isn't actually sending documents. That is done on whatever node you send the indexing (or bulk) request to using its copy of the cluster state which it gets from the master.

This is curious but could have happened for a couple of reasons. If there isn't enough disk space on dnode1 then Elasticsearch will prevent assigning the shards there. If there isn't a copy available to replicate from then Elasticsearch will wait for the shard copy to show up or until you issue a reroute with allow_primary. allow_primary is a destructive operation, losing all data in the shard. It is the right thing to do manually if you know you've lost all copies of the shard but I don't know that it is ever the right thing to do automatically. Elasticsearch isn't likely to know if the node will ever come back.

Even with RAID 5 something is going to go wrong eventually. It should be less likely though.... Elasticsearch really should die if the disk is totally busted but you should also be monitoring the RAID array's health so you can do things like swap out failed disks so that it can rebuild online. For now, that monitoring you should be doing anyway is the only work around I have for Elasticsearch not failing when the disk goes sideways. It isn't a good workaround, but it is something.

kkpatel · October 4, 2016, 3:33pm

Thanks for you feedback. It's unfortunate that there isn't currently a better way available to handle disk failure conditions. We don't have complete control over the servers we run ElasticSearch monitoring raid array health may not be something we can easily do.

With regards to your response about UNASSIGNED shards, I believe what was happening was new indices that were created, during the time dnode2 data disk went south and when we manually stopped it, had primary shards that did not get assigned which I thought was odd. I suppose since the master node didn't see dnode2 as being down, it may have been trying to assign the primary shard for the new index to it but the request was never completed and so it never got properly started.