Any way to exclude not responding node from running ES cluster?

Sp4rKy · June 13, 2015, 11:37am

We have the following setup :

4 data nodes, 2 instances per node (128G / 24 cpus, each instance has 24G)
3 master node (data: false, node: true)

Today one of our node got stuck for some time, cpu were not responding, we got a partial nmi backtrace, rebooted the node, etc ...

The issue is that between the time cpus started to deadlock and the node has been rebooted, it was still part of the ES cluster (the 2 instances on it were still part) but all the requests to it failed. The logs of my other nodes are full of

[2015-06-13 11:34:42,777][DEBUG][action.admin.cluster.node.stats] [master2] failed to execute on node [87O3JOPEThG6iB-xPS16CA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [data2-2][inet[/10.21.0.10:9302]][cluster:monitor/nodes/stats[n]] request_id [652821834] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

And only when I rebooted the server, the 2 instances have been removed from cluster :

[2015-06-13 11:51:30,073][INFO ][cluster.service          ] [master2] removed {[data2][WlA17Qx5T4G4w475oB2y0Q][mongo2.melty][inet[/10.21.0.10:9300]]{host=data2, master=false},}, reason: zen-disco-node_failed([data2][WlA17Qx5T4G4w475oB2y0Q][data2][inet[/10.21.0.10:9300]]{host=data2, master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2015-06-13 11:51:31,856][INFO ][cluster.service          ] [master2] removed {[data2-2][87O3JOPEThG6iB-xPS16CA][data2][inet[/10.21.0.10:9302]]{host=data2, master=false},}, reason: zen-disco-node_failed([data2-2][87O3JOPEThG6iB-xPS16CA][data2][inet[/10.21.0.10:9302]]{host=data2, master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout

So during ~ 15 minutes, none of the requests sent to the 2 instances replied, but the 2 instances were kept in the cluster, leading to more and more requests sent to them (and so causing errors on the client side).

During this time, there is no log at all on the failed node.

So my question is: is there any way to consider a node down if it fail to answer requests during some time ?

Maxence

warkolm · June 14, 2015, 2:41am

If the node was unresponsive then it should have timed out of zen discovery and the master would have removed it from the cluster.

What version are you on? What monitoring do you have on the ES nodes?

Sp4rKy · June 14, 2015, 10:58am

Cluster is running 1.5.2, not sure what you want to know about monitoring , we have fairly basic monitoring outside of the cluster (check cluster state, check shard allocation, things like that).

I can't see any log related to the discovery process during that timeframe, which let me think that node was, somehow, alive but didn't answer to any requests correctly (as cpu were stuck).

Topic		Replies	Views
Elasticsearch cluster timeout when node dies Elasticsearch	2	1641	July 6, 2017
Cluster hanging on node failure Elasticsearch	2	535	July 6, 2017
MasterNotDiscoveredException Elasticsearch	1	305	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	974	July 6, 2017
ES nodes fall out of cluster periodically Elasticsearch	2	568	July 6, 2017

Any way to exclude not responding node from running ES cluster?

Related topics