5 node cluster breaks when master is shut down

The elasticsearch.yml on the master is:

cluster.name: elasticsearchlogstashkibana
node.name: "elksrv1"
node.master: true
node.data: true
index.number_of_replicas: 2
index.number_of_shards: 5
indices.recovery.compress: false

The 4 other nodes are as above but:

cluster.name: elasticsearchlogstashkibana
node.name: "elksrv[2-4]"
node.master: false
node.data: true
index.number_of_replicas: 2
index.number_of_shards: 5
indices.recovery.compress: false

What are the settings to permit resiliency of master so that when it crashes or service is taken down the cluster survives?

Don't have a single master. For a five node cluster it's most likely unnecessary and wasteful to have a dedicated master, especially since it becomes a single point of failure, so just make all data nodes master-eligible and drop the current dedicated master. If you insist on dedicated masters you need three of them.

So the other nodes should have node.master: true instead of the present node.master: false? I.e. all nodes have node.master: true?

Oh, and make sure you set discovery.zen.minimum_master_nodes to N/2+1, i.e. 2 for three node clusters and 3 for four och five node clusters.

So I am hearing this:

cluster.name: elasticsearchlogstashkibana
node.name: "elksrv[1-5]"
node.master: true
node.data: true
index.number_of_replicas: 2
index.number_of_shards: 5
indices.recovery.compress: false
discovery.zen.minimum_master_nodes: 3

for all.

Yes. And you'll probably want to have a five node cluster since a four node cluster won't be able to survive two nodes being down (which I guess is the point of having two replicas?). If your current dedicated master isn't powerful enough to be a data node you can keep it as a pure master node, but that doesn't mean it'll actually be elected master.

Should discovery.zen.ping.unicast.hosts: have the list of all the nodes, nothing, or something else?

You don't have to list all the nodes, but you need to list enough nodes so that the cluster will be able to form even if some of the nodes are down. In your case you'll want to list at least three nodes since two can be out of service. However, since you should be managing your config files with a configuration management tool that can generate files based on templating it might be just as easy to list all nodes.

Okay - this seems to work, mostly, for each of the 5 node's elasticsearch.yml
(but see the concern below the config):
cluster.name: elasticsearchlogstashkibana
node.name: "elksrv5"
node.master: true
node.data: true
index.number_of_shards: 5
index.number_of_replicas: 2
indices.recovery.compress: false
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["elksrv1.channel-corp.com","elksrv2.channel-corp.com","elksrv3.channel-corp.com","elksrv4.channel-corp.com","elksrv5.channel-corp.com"]
script.engine.groovy.inline.update: on

When I test this by taking the non-master's down, no problem. We go yellow then
green after it automatically reassigns a moderate number of shards. The process takes perhaps a minute or two.

When I take this by taking the master down, problem. We go to red and stay red with unassigned shard count (small usually - tested on two nodes) When I turn the now non-master (election to a new master does take place), the small unassigned shard count goes back to zero after a minute and it goes from red to yellow to green.

But the cluster does not recover to yellow or green from a down master, unless I am missing something.

When I take this by taking the master down, problem. We go to red and stay red with unassigned shard count (small usually - tested on two nodes) When I turn the now non-master (election to a new master does take place), the small unassigned shard count goes back to zero after a minute and it goes from red to yellow to green.

So you're saying that shards stay unassigned even though a new master is elected and there is at least one replica of the shards in question? That's unexpected. Are there any clues in the ES logs?

Which version of Elasticsearch are you using? Are all nodes in the cluster the same version?

1 is running 1.2.2.

The others are 1.6.0

Is this a problem?

How do I upgrade 1.2.2 to 1.6.0.

Thanks ahead.

Yes, that is a problem. Different versions use different Lucene versions, so when a shard has been upgraded on one of the newer instances, it can no longer be reallocated to the older node. There may also be other issues depending on which versions are use, so all nodes in a cluster should always be the same version.

Instructions for upgrading can be found here.