Cancellation of tasks

Thomas_Kuisel · July 3, 2023, 6:08pm

Hey @DavidTurner - Our team is seeing a eerily similar issue to this on 8.6.2 and its causing big problems for our cluster with rejected threads and search transport queues.

Would updating to 8.8 potentially fix this?

I see you posted about this exact topic here in December 2022:

github.com/elastic/elasticsearch

Bound the wait for cancelled tasks to complete

opened 11:07AM - 21 Jan 22 UTC

DaveCTurner

>enhancement :Distributed/Task Management Team:Distributed

Cancelling a task on a remote node is an active process: we send a cancellation …request to the remote node and continue to wait for it to indicate the completion of the task. Typically the task will complete with a `TaskCancelledException` but it might fail in a different way, or even succeed, if the cancellation loses the race to completion. If the remote node is unable to respond for some reason then today we wait indefinitely, and this means we cannot free the resources held by the listener. If the remote node remains unresponsive for long enough then the build-up of listeners on other nodes can cause a cascading failure. In contrast, if we specify a timeout on the transport request that triggers the remote task then we complete the listener eagerly at the timeout, although the task is still running remotely. Indeed we don't even attempt to cancel the remote task in this case (https://github.com/elastic/elasticsearch/issues/66992) so it just keeps on running. I believe we should not wait indefinitely for a cancelled task to complete and should instead unilaterally complete the waiting listener with a `TaskCancelledException` to protect against an unresponsive remote node. Maybe we should do this straight away, similarly to how we handle a timeout, but we could also allow some time for the cancellation to happen gracefully first. Relates https://github.com/elastic/elasticsearch/issues/82337 which describes a similar problem specifically about stats requests, since these are often the source of actual problems in this area.

Then the follow PR' was merged and mentioned in the 8.8 release notes:

github.com/elastic/elasticsearch

Child requests proactively cancel children tasks

elastic:main ← kingherc:enhancement/90353-66992-cancel-child-on-timeout

opened 01:55PM - 28 Dec 22 UTC

kingherc

+435 -95

To make this possible we modify the CancellableTasksTracker to track children ta…sks by the Request ID as well. That way, we can send an Action to cancel a child based on the parent task and the Request ID. This is especially useful when parents' children requests timeout on the parents' side. The motivation behind this PR lies behind fixing test failure #90353. In discussing the simple solution of PR https://github.com/elastic/elasticsearch/pull/92520, we decided with @DaveCTurner that the best approach to solving the test failure would be to solve #66992. Unfortunately that issue may require substantial effort. But for the moment, we thought it would be easier to cancel children requests on timeout, since we already have infrastructure for tracking children tasks (through the `CancellableTasksTracker`). Fixes #90353 Relates #66992

Thank you for any guidance you can provide, this has been a big undertaking troubleshooting this issue . I made a similar post here with details just now which led me to here:

Topic		Replies	Views
How to cancel search using the Task Management API Elasticsearch	5	10234	March 6, 2017
Case Heavy load : How we can search all search request cancellable to cancellation Elasticsearch	4	322	October 28, 2020
Why a cancelled task is still on the list? Elasticsearch	8	574	July 19, 2023
Stuck "Cancelled Tasks" In ElasticSearch 8.6.2 causing Cluster failure Elasticsearch docker	19	1441	August 8, 2023
Is searches cancellation with Task API more effective than setting request timeout? Elasticsearch	1	365	March 4, 2019

Cancellation of tasks

Related topics