Node failure and recovery


I have a cluster in Elasticsearch version 1.4.4 that is currently under-provisioned. Currently, when a node falls off the cluster the other nodes start reassigning shards from the failed nodes to the other nodes. This jacks up the heap/disk space used on the other nodes. I have the gateway.recover_after_nodes set so I thought that it would wait until the node had rejoined and then reinitialize the shards to the node that had failed, but that doesn't seem to be the case (i.e. when you disable shard allocation and restart a node).

Is there a setting that would wait and then reinitialize to the node that had fallen off the cluster and then rejoined?

I have these set:
gateway.recover_after_nodes: 8
gateway.expected_nodes: 8

I am mostly just curious if I'm using these settings wrong or there is a bug in this version. I am working on upgrading the cluster to version 5.x but it won't be immediate so this would be triage in the meantime.


What you're looking for is delayed shard allocation (5.x version docs | 1.7 version docs). That functionality was added in 1.7.0:

I'd recommend at least upgrading to 1.7 for a variety of reasons, not least of which includes 1.4 has multiple security vulnerabilities associated with it. But really your 5.x plan is much better.

Ah yes, I remember that's what it is called now. I definitely have clusters across all the major versions so it's easy for me to mix the settings up.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.