Hey @DavidTurner - Our team is seeing a eerily similar issue to this on 8.6.2 and its causing big problems for our cluster with rejected threads and search transport queues.
Would updating to 8.8 potentially fix this?
I see you posted about this exact topic here in December 2022:
opened 11:07AM - 21 Jan 22 UTC
>enhancement
:Distributed/Task Management
Team:Distributed
Cancelling a task on a remote node is an active process: we send a cancellation … request to the remote node and continue to wait for it to indicate the completion of the task. Typically the task will complete with a `TaskCancelledException` but it might fail in a different way, or even succeed, if the cancellation loses the race to completion. If the remote node is unable to respond for some reason then today we wait indefinitely, and this means we cannot free the resources held by the listener. If the remote node remains unresponsive for long enough then the build-up of listeners on other nodes can cause a cascading failure.
In contrast, if we specify a timeout on the transport request that triggers the remote task then we complete the listener eagerly at the timeout, although the task is still running remotely. Indeed we don't even attempt to cancel the remote task in this case (https://github.com/elastic/elasticsearch/issues/66992) so it just keeps on running.
I believe we should not wait indefinitely for a cancelled task to complete and should instead unilaterally complete the waiting listener with a `TaskCancelledException` to protect against an unresponsive remote node. Maybe we should do this straight away, similarly to how we handle a timeout, but we could also allow some time for the cancellation to happen gracefully first.
Relates https://github.com/elastic/elasticsearch/issues/82337 which describes a similar problem specifically about stats requests, since these are often the source of actual problems in this area.
Then the follow PR' was merged and mentioned in the 8.8 release notes:
elastic:main
← kingherc:enhancement/90353-66992-cancel-child-on-timeout
opened 01:55PM - 28 Dec 22 UTC
To make this possible we modify the CancellableTasksTracker to track children ta… sks by the Request ID as well. That way, we can send an Action to cancel a child based on the parent task and the Request ID.
This is especially useful when parents' children requests timeout on the parents' side.
The motivation behind this PR lies behind fixing test failure #90353. In discussing the simple solution of PR https://github.com/elastic/elasticsearch/pull/92520, we decided with @DaveCTurner that the best approach to solving the test failure would be to solve #66992. Unfortunately that issue may require substantial effort. But for the moment, we thought it would be easier to cancel children requests on timeout, since we already have infrastructure for tracking children tasks (through the `CancellableTasksTracker`).
Fixes #90353
Relates #66992
Thank you for any guidance you can provide, this has been a big undertaking troubleshooting this issue . I made a similar post here with details just now which led me to here:
Hi - We are using Elasticsearch 8.6.2 running on Azure AKS and noticed some serious issues lately.
As of last week the week of Jun30 2023, we started noticing huge unresolved search transport queues in our cluster. The queues grow indefinitely until eventually thread pools start getting rejected.
The issue happens sporadically and only seems to occur in prod under high volume. We have a ~48 data node cluster with 106 shards targeted 50GB for shard size.
We did not make any major infra chan…