I have 5 nodes, all eligible as master and all data nodes.discovery.zen.minimum_master_nodes is set to 3:
node1,node2,node3,node4 and node5. Everything else is default.
So I tried to simulate a network problem between node5 and the cluster. So I blocked (via firewall) outgoing from node5 to all other nodes and blocked incoming to node5 from the other nodes.
I should see node5 has been removed from cluster but instead I cannot do any POST/GET to ANY node in the cluster. On the master, node3, I can see timeouts connecting to node5.
Querying any node including node5 (the isolated node) e.g. node1:9200/_cluster/health is retuning green and it thinks it has 5 nodes. After about 30 minutes the nodes finally realise node5 is not responding and they remove him from the pool.
The whole cluster went down for over 30 minutes in one node is isolated. I can see the following on node3 logs repeated every minute, who is master at this time:
[node3] failed to execute on node [VdSa2w0tSwiUyNNFhzvNXg]
org.elasticsearch.transport.NodeDisconnectedException: [node5][172.16.99.234:9300][cluster:monitor/nodes/stats[n]] disconnected
Why is it the cluster is unusable when a node is isolated? What can I do to speed up the process of recovery?
Here is the config which is the same on all 5 nodes except name:
discovery.zen.ping.unicast.hosts: ["172.16.99.230", "172.16.99.231", "172.16.99.232", "172.16.99.233", "172.16.99.234"]