Stuck "Cancelled Tasks" In ElasticSearch 8.6.2 causing Cluster failure

Hmm actually having said that I don't think it's a config issue, Linux has a silly default that means that it can take 900+ seconds between a connection drop and a notification to userspace about the connection drop. This page of the manual has more details. You'd see messages about dropped connections in the logs if it was this.

You'd need to be a bit more precise about what you mean by "kill our cluster" tho. The other recent thread on this topic has the zombie tasks consuming a lot of CPU, but that wouldn't happen if they were just waiting for a dead connection to time out. They'd potentially hold on to a lot of heap, causing GC pressure, but wouldn't themselves consume any CPU.