I've got a large 5.6.3 cluster with a couple high-query-rate (low update rate) indices spread evenly across the nodes. We had a node freeze up (VM paused during unplanned hypervisor restart) which correlated with a large error spike in our querying. I guess that's to be expected. But,
All the data nodes' CPU utilizations dropped during the period the node was timing out, about 1.5 minutes. The error spike seemed much larger than the number of in-flight queries you'd expect to have dropped from the loss of one node. Cluster-wide the queries dropped to 0, again for about 1.5 minutes.
I get the impression that the cluster was basically put on pause while the node timed out. If that's plausible, I'm expecting it's because of inability to update cluster state while trying to reach a node it thinks should be there, and that that somehow blocked broadcast of new queries. (Or... every query was still trying to hit the absent node?) But does ES really behave this way? I don't remember having seen this in the past.
Does anyone have opinion on the pros/cons of setting a faster node transport ping timeout? I figure if the cluster is paused while waiting for node timeouts, and if timeouts take 1.5 minutes (3 ping rounds of 30s each?), and if nodes generally ping very reliably, then shortening the ping timeout to, say, 5 seconds could prevent long outages from single node failures without too frequently marking nodes as timed out.