Unnassigned Shards After Node Restart

We have a 3 node ES cluster (2.3.3) with 4 cores and 16G RAM. The config on each node includes:
gateway.expected_nodes: "3"
gateway.recover_after_nodes: "2"
5 shards per index

Last week while doing maintenance I did the following (this is a development cluster):

  1. Set cluster settings: "cluster.routing.allocation.enable": "none"
  2. Rebooted 2 nodes
  3. Set cluster settings: "cluster.routing.allocation.enable": "all"

What I expected is that the remaining node would pause while the two nodes were down. Then once the nodes had restarted that the cluster would come back up and recover.

What in fact happened is that the cluster came back up (indicating there was 3 nodes) and for a while showed a positive count of initializing shards. Eventually initializing shards became zero but only 66 percent of shards were active. _cat/shards indicated that many shards were "unassigned". Cluster state is red.

I suspect what I did may not have been a good idea, I should have taken one node down at a time. But I don't understand why the rebooted nodes would not have been able to restart the shards they owned. Its a cause for concern because for logistical reasons we have to run two of our nodes in a single data center.

Can anyone shed some light on what could have happened?


You'd need to dig into your logs.

But, did you have minimum masters set?

Yes, I should have mentioned minimum masters was set to 2.

The logs show nothing that I would regard as unusual in the circumstances...

On remaining node
"not enough master nodes", then eventually other two nodes join again

On rebooted nodes
Usual startup logs
A couple of exceptions because they can't find some groovy scripts (I don't imagine that would matter)
Detection of remaining node as master