We have the following setup :
- 4 data nodes, 2 instances per node (128G / 24 cpus, each instance has 24G)
- 3 master node (data: false, node: true)
Today one of our node got stuck for some time, cpu were not responding, we got a partial nmi backtrace, rebooted the node, etc ...
The issue is that between the time cpus started to deadlock and the node has been rebooted, it was still part of the ES cluster (the 2 instances on it were still part) but all the requests to it failed. The logs of my other nodes are full of
[2015-06-13 11:34:42,777][DEBUG][action.admin.cluster.node.stats] [master2] failed to execute on node [87O3JOPEThG6iB-xPS16CA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [data2-2][inet[/10.21.0.10:9302]][cluster:monitor/nodes/stats[n]] request_id [652821834] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
And only when I rebooted the server, the 2 instances have been removed from cluster :
[2015-06-13 11:51:30,073][INFO ][cluster.service ] [master2] removed {[data2][WlA17Qx5T4G4w475oB2y0Q][mongo2.melty][inet[/10.21.0.10:9300]]{host=data2, master=false},}, reason: zen-disco-node_failed([data2][WlA17Qx5T4G4w475oB2y0Q][data2][inet[/10.21.0.10:9300]]{host=data2, master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2015-06-13 11:51:31,856][INFO ][cluster.service ] [master2] removed {[data2-2][87O3JOPEThG6iB-xPS16CA][data2][inet[/10.21.0.10:9302]]{host=data2, master=false},}, reason: zen-disco-node_failed([data2-2][87O3JOPEThG6iB-xPS16CA][data2][inet[/10.21.0.10:9302]]{host=data2, master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout
So during ~ 15 minutes, none of the requests sent to the 2 instances replied, but the 2 instances were kept in the cluster, leading to more and more requests sent to them (and so causing errors on the client side).
During this time, there is no log at all on the failed node.
So my question is: is there any way to consider a node down if it fail to answer requests during some time ?
Maxence