We have a 3 node ES cluster (2.3.3) with 4 cores and 16G RAM. The config on each node includes:
gateway.expected_nodes: "3"
gateway.recover_after_nodes: "2"
5 shards per index
Last week while doing maintenance I did the following (this is a development cluster):
- Set cluster settings: "cluster.routing.allocation.enable": "none"
- Rebooted 2 nodes
- Set cluster settings: "cluster.routing.allocation.enable": "all"
What I expected is that the remaining node would pause while the two nodes were down. Then once the nodes had restarted that the cluster would come back up and recover.
What in fact happened is that the cluster came back up (indicating there was 3 nodes) and for a while showed a positive count of initializing shards. Eventually initializing shards became zero but only 66 percent of shards were active. _cat/shards indicated that many shards were "unassigned". Cluster state is red.
I suspect what I did may not have been a good idea, I should have taken one node down at a time. But I don't understand why the rebooted nodes would not have been able to restart the shards they owned. Its a cause for concern because for logistical reasons we have to run two of our nodes in a single data center.
Can anyone shed some light on what could have happened?
Thanks