Cancellation of tasks

henrhoi · July 3, 2023, 12:32pm

Hello Elastic community!

We have been encountering some difficulties with cancelling longer running tasks in our Elasticsearch cluster and could really use some guidance from the experts here. Our current approach involves cancelling any search tasks that take over 60 seconds to complete using the Task management API, but unfortunately, we are not able to cancel them properly. This issue has caused our client nodes' memory and CPU usage to skyrocket to 100%, resulting in unresponsiveness and outages in our platform.

We understand the importance of efficiently managing long-running tasks to ensure the stability and performance of our Elasticsearch cluster. Therefore, we are reaching out to seek advice on how to effectively cancel these tasks. Ideally, we would like to terminate the search tasks immediately upon triggering the cancellation.

If any of you have faced similar challenges or have experience with cancelling tasks in Elasticsearch, we would greatly appreciate your insights and recommendations. Here are a few specific questions we have:

Is there a recommended approach or best practice for cancelling search tasks in Elasticsearch that are taking longer than a specified duration? We need to force this cancellation at both client and data nodes, and waiting the request out is not an option.
Are there any specific configurations or settings we should consider adjusting to improve the cancellation process?
What are the potential reasons or factors that could be preventing the successful cancellation of these tasks?
Are there any alternative methods or techniques we could employ to forcefully terminate these tasks to prevent resource exhaustion and cluster instability?

Any guidance, tips, or suggestions you can provide would be immensely valuable to us. We are eager to resolve this issue and ensure smooth operation of our platform. Thank you in advance for your time and assistance!

Best regards

DavidTurner · July 3, 2023, 12:39pm

It should be sufficient to cancel on the client end. How exactly are you determining that this isn't working?

Can you share GET _tasks?detailed and GET _nodes/hot_threads?threads=9999 captured at the time of the problems?

What exactly do you mean by "cluster stability"?

henrhoi · July 3, 2023, 1:00pm

We are using a cronjob that fetches all search tasks, using the endpoint you referred to, and cancels the tasks that have been running for more than 60 seconds, using their task ids. During our last incident on Friday, the task was cancel request was sent at 12:17pm and the request waiting for cancellation returned at 1:28pm as cancelled.

After cancelling a request after 60 seconds, we still see that the CPU and memory of the affected client node (which has the heap configuration -Xmx10g -Xms10g) goes towards 100% for the next 30 minutes.

I'll share the endpoints requested next time it happens. Anything else I could share to figure out why the cancellation doesn't work for us?

DavidTurner · July 3, 2023, 1:04pm

We are using a cronjob that fetches all search tasks, using the endpoint you referred to, and cancels the tasks that have been running for more than 60 seconds, using their task ids.

It'd be better to time out on the client side instead of using the task-cancel API, but either way the behaviour you're describing sounds like a bug to me.

I'll share the endpoints requested next time it happens. Anything else I could share to figure out why the cancellation doesn't work for us?

I think tasks and hot threads should be enough to get started.

DavidTurner · July 3, 2023, 1:20pm

Oh sorry, one other thing: if the problem lasts 30 minutes then could you get GET _tasks?detailed and GET _nodes/hot_threads?threads=9999 every minute or so?

henrhoi · July 3, 2023, 2:52pm

Thank you @DavidTurner. I'll get back to you the information you are requesting, as soon as it happens again.

When you're saying "time out on the client side", do you refer to setting timeout-parameter in the requests or using the cluster setting search.default_search_timeout? Will that cancel the search task on both client and data nodes, or something else?

DavidTurner · July 3, 2023, 3:18pm

Neither, I mean an HTTP-level timeout. If the client doesn't get a response within the time you want, it should just close the network connection. Most HTTP clients have such an option.

henrhoi · July 3, 2023, 3:52pm

I see, and that will immediately stop all processing on both elastic client (ingest) and data nodes in the Elastic cluster that are caused by the request that times out?

DavidTurner · July 3, 2023, 4:17pm

It should, in the sense that closing the HTTP connection translates to an automatic call to the task-cancellation API. But TBC if the task cancellation API isn't actually stopping the work promptly then nor will this.

It's still worth doing tho, since it means there's no need for a separate cron job and also it frees up client-side resources earlier (a file descriptor for the network connection and a bit of memory normally). Then when we work out how to get cancellation to happen more promptly it'll all be good.

Thomas_Kuisel · July 3, 2023, 6:08pm

Hey @DavidTurner - Our team is seeing a eerily similar issue to this on 8.6.2 and its causing big problems for our cluster with rejected threads and search transport queues.

Would updating to 8.8 potentially fix this?

I see you posted about this exact topic here in December 2022:

github.com/elastic/elasticsearch

Bound the wait for cancelled tasks to complete

opened 11:07AM - 21 Jan 22 UTC

DaveCTurner

>enhancement :Distributed/Task Management Team:Distributed

Cancelling a task on a remote node is an active process: we send a cancellation …request to the remote node and continue to wait for it to indicate the completion of the task. Typically the task will complete with a `TaskCancelledException` but it might fail in a different way, or even succeed, if the cancellation loses the race to completion. If the remote node is unable to respond for some reason then today we wait indefinitely, and this means we cannot free the resources held by the listener. If the remote node remains unresponsive for long enough then the build-up of listeners on other nodes can cause a cascading failure. In contrast, if we specify a timeout on the transport request that triggers the remote task then we complete the listener eagerly at the timeout, although the task is still running remotely. Indeed we don't even attempt to cancel the remote task in this case (https://github.com/elastic/elasticsearch/issues/66992) so it just keeps on running. I believe we should not wait indefinitely for a cancelled task to complete and should instead unilaterally complete the waiting listener with a `TaskCancelledException` to protect against an unresponsive remote node. Maybe we should do this straight away, similarly to how we handle a timeout, but we could also allow some time for the cancellation to happen gracefully first. Relates https://github.com/elastic/elasticsearch/issues/82337 which describes a similar problem specifically about stats requests, since these are often the source of actual problems in this area.

Then the follow PR' was merged and mentioned in the 8.8 release notes:

github.com/elastic/elasticsearch

Child requests proactively cancel children tasks

elastic:main ← kingherc:enhancement/90353-66992-cancel-child-on-timeout

opened 01:55PM - 28 Dec 22 UTC

kingherc

+435 -95

To make this possible we modify the CancellableTasksTracker to track children ta…sks by the Request ID as well. That way, we can send an Action to cancel a child based on the parent task and the Request ID. This is especially useful when parents' children requests timeout on the parents' side. The motivation behind this PR lies behind fixing test failure #90353. In discussing the simple solution of PR https://github.com/elastic/elasticsearch/pull/92520, we decided with @DaveCTurner that the best approach to solving the test failure would be to solve #66992. Unfortunately that issue may require substantial effort. But for the moment, we thought it would be easier to cancel children requests on timeout, since we already have infrastructure for tracking children tasks (through the `CancellableTasksTracker`). Fixes #90353 Relates #66992

Thank you for any guidance you can provide, this has been a big undertaking troubleshooting this issue . I made a similar post here with details just now which led me to here:

DavidTurner · July 3, 2023, 6:48pm

Thanks for the ping @Thomas_Kuisel. Your issue sounds related, but maybe a little different. Let's keep the two topics separate to avoid confusion. I've replied on your thread.

system · July 31, 2023, 6:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to cancel search using the Task Management API Elasticsearch	5	10237	March 6, 2017
Case Heavy load : How we can search all search request cancellable to cancellation Elasticsearch	4	322	October 28, 2020
Why a cancelled task is still on the list? Elasticsearch	8	577	July 19, 2023
Stuck "Cancelled Tasks" In ElasticSearch 8.6.2 causing Cluster failure Elasticsearch docker	19	1448	August 8, 2023
Is searches cancellation with Task API more effective than setting request timeout? Elasticsearch	1	366	March 4, 2019

Cancellation of tasks

Related topics