Long period of querying failure during node timeout

rsk0 · April 17, 2020, 4:00am

I've got a large 5.6.3 cluster with a couple high-query-rate (low update rate) indices spread evenly across the nodes. We had a node freeze up (VM paused during unplanned hypervisor restart) which correlated with a large error spike in our querying. I guess that's to be expected. But,

All the data nodes' CPU utilizations dropped during the period the node was timing out, about 1.5 minutes. The error spike seemed much larger than the number of in-flight queries you'd expect to have dropped from the loss of one node. Cluster-wide the queries dropped to 0, again for about 1.5 minutes.

I get the impression that the cluster was basically put on pause while the node timed out. If that's plausible, I'm expecting it's because of inability to update cluster state while trying to reach a node it thinks should be there, and that that somehow blocked broadcast of new queries. (Or... every query was still trying to hit the absent node?) But does ES really behave this way? I don't remember having seen this in the past.

Does anyone have opinion on the pros/cons of setting a faster node transport ping timeout? I figure if the cluster is paused while waiting for node timeouts, and if timeouts take 1.5 minutes (3 ping rounds of 30s each?), and if nodes generally ping very reliably, then shortening the ping timeout to, say, 5 seconds could prevent long outages from single node failures without too frequently marking nodes as timed out.

Christian_Dahlqvist · April 17, 2020, 4:49am

I believe this has been changed/improved quite significantly in Elasticsearch 7.x, so would recommend you upgrade.

rsk0 · April 17, 2020, 6:50am

Excellent. That sounds like a good idea. We've already migrated a couple clusters to 7.6.1. Glad to know they're better off.

Any idea if I could do something for the remaining clusters while we wait to get them moved over?

DavidTurner · April 17, 2020, 8:36am

I think the solution is to set set tcp_retries2 to a lower value rather than adjusting the ping timeouts in Elasticsearch. Shorter pings in Elasticsearch will result in instability if under GC pressure, whereas tcp_retries2 lets Elasticsearch detect that a node has completely vanished from the network much more quickly without being affected by GC pauses. The default value of tcp_retries2 is 15, which equates to a timeout of well over ten minutes which is just ridiculously long. If you set it to 3 as that article recommends then a node that vanishes will be detected within a second or so.

system · May 15, 2020, 8:36am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster nodes get disconnected and out of sync due to ping timeouts caused by transport load Elasticsearch	4	3208	July 5, 2017
Increased node timeouts after 7.7.0 to 7.10.1 upgrade Elasticsearch	3	709	January 16, 2021
1 node in an elasticsearch cluster getting stuck for 15 minutes and then starts working Elasticsearch	3	97	June 27, 2024
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017
Node "timeout" possibly due to GC? Elasticsearch	5	776	July 5, 2017

Long period of querying failure during node timeout

Related topics