We have been encountering some difficulties with cancelling longer running tasks in our Elasticsearch cluster and could really use some guidance from the experts here. Our current approach involves cancelling any search tasks that take over 60 seconds to complete using the Task management API, but unfortunately, we are not able to cancel them properly. This issue has caused our client nodes' memory and CPU usage to skyrocket to 100%, resulting in unresponsiveness and outages in our platform.
We understand the importance of efficiently managing long-running tasks to ensure the stability and performance of our Elasticsearch cluster. Therefore, we are reaching out to seek advice on how to effectively cancel these tasks. Ideally, we would like to terminate the search tasks immediately upon triggering the cancellation.
If any of you have faced similar challenges or have experience with cancelling tasks in Elasticsearch, we would greatly appreciate your insights and recommendations. Here are a few specific questions we have:
Is there a recommended approach or best practice for cancelling search tasks in Elasticsearch that are taking longer than a specified duration? We need to force this cancellation at both client and data nodes, and waiting the request out is not an option.
Are there any specific configurations or settings we should consider adjusting to improve the cancellation process?
What are the potential reasons or factors that could be preventing the successful cancellation of these tasks?
Are there any alternative methods or techniques we could employ to forcefully terminate these tasks to prevent resource exhaustion and cluster instability?
Any guidance, tips, or suggestions you can provide would be immensely valuable to us. We are eager to resolve this issue and ensure smooth operation of our platform. Thank you in advance for your time and assistance!
We are using a cronjob that fetches all search tasks, using the endpoint you referred to, and cancels the tasks that have been running for more than 60 seconds, using their task ids. During our last incident on Friday, the task was cancel request was sent at 12:17pm and the request waiting for cancellation returned at 1:28pm as cancelled.
After cancelling a request after 60 seconds, we still see that the CPU and memory of the affected client node (which has the heap configuration -Xmx10g -Xms10g) goes towards 100% for the next 30 minutes.
I'll share the endpoints requested next time it happens. Anything else I could share to figure out why the cancellation doesn't work for us?
We are using a cronjob that fetches all search tasks, using the endpoint you referred to, and cancels the tasks that have been running for more than 60 seconds, using their task ids.
It'd be better to time out on the client side instead of using the task-cancel API, but either way the behaviour you're describing sounds like a bug to me.
I'll share the endpoints requested next time it happens. Anything else I could share to figure out why the cancellation doesn't work for us?
I think tasks and hot threads should be enough to get started.
Oh sorry, one other thing: if the problem lasts 30 minutes then could you get GET _tasks?detailed and GET _nodes/hot_threads?threads=9999 every minute or so?
Thank you @DavidTurner. I'll get back to you the information you are requesting, as soon as it happens again.
When you're saying "time out on the client side", do you refer to setting timeout-parameter in the requests or using the cluster setting search.default_search_timeout? Will that cancel the search task on both client and data nodes, or something else?
Neither, I mean an HTTP-level timeout. If the client doesn't get a response within the time you want, it should just close the network connection. Most HTTP clients have such an option.
I see, and that will immediately stop all processing on both elastic client (ingest) and data nodes in the Elastic cluster that are caused by the request that times out?
It should, in the sense that closing the HTTP connection translates to an automatic call to the task-cancellation API. But TBC if the task cancellation API isn't actually stopping the work promptly then nor will this.
It's still worth doing tho, since it means there's no need for a separate cron job and also it frees up client-side resources earlier (a file descriptor for the network connection and a bit of memory normally). Then when we work out how to get cancellation to happen more promptly it'll all be good.
Hey @DavidTurner - Our team is seeing a eerily similar issue to this on 8.6.2 and its causing big problems for our cluster with rejected threads and search transport queues.
Would updating to 8.8 potentially fix this?
I see you posted about this exact topic here in December 2022:
Then the follow PR' was merged and mentioned in the 8.8 release notes:
Thank you for any guidance you can provide, this has been a big undertaking troubleshooting this issue . I made a similar post here with details just now which led me to here:
Thanks for the ping @Thomas_Kuisel. Your issue sounds related, but maybe a little different. Let's keep the two topics separate to avoid confusion. I've replied on your thread.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.