We have been encountering some difficulties with cancelling longer running tasks in our Elasticsearch cluster and could really use some guidance from the experts here. Our current approach involves cancelling any search tasks that take over 60 seconds to complete using the Task management API, but unfortunately, we are not able to cancel them properly. This issue has caused our client nodes' memory and CPU usage to skyrocket to 100%, resulting in unresponsiveness and outages in our platform.
We understand the importance of efficiently managing long-running tasks to ensure the stability and performance of our Elasticsearch cluster. Therefore, we are reaching out to seek advice on how to effectively cancel these tasks. Ideally, we would like to terminate the search tasks immediately upon triggering the cancellation.
If any of you have faced similar challenges or have experience with cancelling tasks in Elasticsearch, we would greatly appreciate your insights and recommendations. Here are a few specific questions we have:
Is there a recommended approach or best practice for cancelling search tasks in Elasticsearch that are taking longer than a specified duration? We need to force this cancellation at both client and data nodes, and waiting the request out is not an option.
Are there any specific configurations or settings we should consider adjusting to improve the cancellation process?
What are the potential reasons or factors that could be preventing the successful cancellation of these tasks?
Are there any alternative methods or techniques we could employ to forcefully terminate these tasks to prevent resource exhaustion and cluster instability?
Any guidance, tips, or suggestions you can provide would be immensely valuable to us. We are eager to resolve this issue and ensure smooth operation of our platform. Thank you in advance for your time and assistance!
We are using a cronjob that fetches all search tasks, using the endpoint you referred to, and cancels the tasks that have been running for more than 60 seconds, using their task ids. During our last incident on Friday, the task was cancel request was sent at 12:17pm and the request waiting for cancellation returned at 1:28pm as cancelled.
After cancelling a request after 60 seconds, we still see that the CPU and memory of the affected client node (which has the heap configuration -Xmx10g -Xms10g) goes towards 100% for the next 30 minutes.
I'll share the endpoints requested next time it happens. Anything else I could share to figure out why the cancellation doesn't work for us?
Thank you @DavidTurner. I'll get back to you the information you are requesting, as soon as it happens again.
When you're saying "time out on the client side", do you refer to setting timeout-parameter in the requests or using the cluster setting search.default_search_timeout? Will that cancel the search task on both client and data nodes, or something else?
It should, in the sense that closing the HTTP connection translates to an automatic call to the task-cancellation API. But TBC if the task cancellation API isn't actually stopping the work promptly then nor will this.
It's still worth doing tho, since it means there's no need for a separate cron job and also it frees up client-side resources earlier (a file descriptor for the network connection and a bit of memory normally). Then when we work out how to get cancellation to happen more promptly it'll all be good.