Any way to exclude not responding node from running ES cluster?

We have the following setup :

  • 4 data nodes, 2 instances per node (128G / 24 cpus, each instance has 24G)
  • 3 master node (data: false, node: true)

Today one of our node got stuck for some time, cpu were not responding, we got a partial nmi backtrace, rebooted the node, etc ...

The issue is that between the time cpus started to deadlock and the node has been rebooted, it was still part of the ES cluster (the 2 instances on it were still part) but all the requests to it failed. The logs of my other nodes are full of

[2015-06-13 11:34:42,777][DEBUG][action.admin.cluster.node.stats] [master2] failed to execute on node [87O3JOPEThG6iB-xPS16CA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [data2-2][inet[/10.21.0.10:9302]][cluster:monitor/nodes/stats[n]] request_id [652821834] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

And only when I rebooted the server, the 2 instances have been removed from cluster :

[2015-06-13 11:51:30,073][INFO ][cluster.service          ] [master2] removed {[data2][WlA17Qx5T4G4w475oB2y0Q][mongo2.melty][inet[/10.21.0.10:9300]]{host=data2, master=false},}, reason: zen-disco-node_failed([data2][WlA17Qx5T4G4w475oB2y0Q][data2][inet[/10.21.0.10:9300]]{host=data2, master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2015-06-13 11:51:31,856][INFO ][cluster.service          ] [master2] removed {[data2-2][87O3JOPEThG6iB-xPS16CA][data2][inet[/10.21.0.10:9302]]{host=data2, master=false},}, reason: zen-disco-node_failed([data2-2][87O3JOPEThG6iB-xPS16CA][data2][inet[/10.21.0.10:9302]]{host=data2, master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout

So during ~ 15 minutes, none of the requests sent to the 2 instances replied, but the 2 instances were kept in the cluster, leading to more and more requests sent to them (and so causing errors on the client side).

During this time, there is no log at all on the failed node.

So my question is: is there any way to consider a node down if it fail to answer requests during some time ?

Maxence

If the node was unresponsive then it should have timed out of zen discovery and the master would have removed it from the cluster.

What version are you on? What monitoring do you have on the ES nodes?

Cluster is running 1.5.2, not sure what you want to know about monitoring , we have fairly basic monitoring outside of the cluster (check cluster state, check shard allocation, things like that).

I can't see any log related to the discovery process during that timeframe, which let me think that node was, somehow, alive but didn't answer to any requests correctly (as cpu were stuck).