Long period of querying failure during node timeout

I've got a large 5.6.3 cluster with a couple high-query-rate (low update rate) indices spread evenly across the nodes. We had a node freeze up (VM paused during unplanned hypervisor restart) which correlated with a large error spike in our querying. I guess that's to be expected. But,

All the data nodes' CPU utilizations dropped during the period the node was timing out, about 1.5 minutes. The error spike seemed much larger than the number of in-flight queries you'd expect to have dropped from the loss of one node. Cluster-wide the queries dropped to 0, again for about 1.5 minutes.

I get the impression that the cluster was basically put on pause while the node timed out. If that's plausible, I'm expecting it's because of inability to update cluster state while trying to reach a node it thinks should be there, and that that somehow blocked broadcast of new queries. (Or... every query was still trying to hit the absent node?) But does ES really behave this way? I don't remember having seen this in the past.

Does anyone have opinion on the pros/cons of setting a faster node transport ping timeout? I figure if the cluster is paused while waiting for node timeouts, and if timeouts take 1.5 minutes (3 ping rounds of 30s each?), and if nodes generally ping very reliably, then shortening the ping timeout to, say, 5 seconds could prevent long outages from single node failures without too frequently marking nodes as timed out.

I believe this has been changed/improved quite significantly in Elasticsearch 7.x, so would recommend you upgrade.

Excellent. That sounds like a good idea. We've already migrated a couple clusters to 7.6.1. Glad to know they're better off.

Any idea if I could do something for the remaining clusters while we wait to get them moved over?

I think the solution is to set set tcp_retries2 to a lower value rather than adjusting the ping timeouts in Elasticsearch. Shorter pings in Elasticsearch will result in instability if under GC pressure, whereas tcp_retries2 lets Elasticsearch detect that a node has completely vanished from the network much more quickly without being affected by GC pauses. The default value of tcp_retries2 is 15, which equates to a timeout of well over ten minutes which is just ridiculously long. If you set it to 3 as that article recommends then a node that vanishes will be detected within a second or so.

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.