Elasticsearch Task API does not cancel tasks

Hi

I'm currently having problem with my cluster, trying debuging certain documents that where indexed, I made a query with script in it, and queue various request of that search.

The problem is that now I'm trying to cancel that tasks, and it says that its cancelled, but still appear in the list of task still running for several hours now.
Tried to kill the parent tasks, along with the children ones, but still running.
The kibana stops connecting to the elasticsearch, and the only way to do something it's from curl

The elastic stack version is 6.2.4
Doing the following request

curl -X GET "localhost:9200/_cat/tasks?v"

I get

indices:data/read/search a98sSQVJRtefhq4egRBVkg:3308613042 - transport 1542808265542 10:51:05 2.8h 172.27.202.150 es1
indices:data/read/search[phase/query] a98sSQVJRtefhq4egRBVkg:3308613043 a98sSQVJRtefhq4egRBVkg:3308613042 direct 1542808265542 10:51:05 2.8h x.x.x.x es1
indices:data/read/search[phase/query] a98sSQVJRtefhq4egRBVkg:3308613044 a98sSQVJRtefhq4egRBVkg:3308613042 direct 1542808265542 10:51:05 2.8h x.x.x.x es1
indices:data/read/search[phase/query] a98sSQVJRtefhq4egRBVkg:3308613045 a98sSQVJRtefhq4egRBVkg:3308613042 direct 1542808265543 10:51:05 2.8h x.x.x.x es1
indices:data/read/search[phase/query] a98sSQVJRtefhq4egRBVkg:3308613046 a98sSQVJRtefhq4egRBVkg:3308613042 direct 1542808265543 10:51:05 2.8h x.x.x.x es1
indices:data/read/search[phase/query] LnF0A5gATiK3Fjy-cAUSLQ:2222282441 a98sSQVJRtefhq4egRBVkg:3308613042 netty 1542808265596 10:51:05 2.8h x.x.x.x es3
indices:data/read/search[phase/query] LnF0A5gATiK3Fjy-cAUSLQ:2222282439 a98sSQVJRtefhq4egRBVkg:3308613042 netty 1542808265596 10:51:05 2.8h x.x.x.x es3
indices:data/read/search[phase/query] LnF0A5gATiK3Fjy-cAUSLQ:2222282442 a98sSQVJRtefhq4egRBVkg:3308613042 netty 1542808265597 10:51:05 2.8h x.x.x.x es3
indices:data/read/search[phase/query] xIZT5fVqQuugw2qYbBRrXA:3595912068 a98sSQVJRtefhq4egRBVkg:3308613042 netty 1542808265607 10:51:05 2.8h x.x.x.x es2
indices:data/read/search[phase/query] xIZT5fVqQuugw2qYbBRrXA:3595912066 a98sSQVJRtefhq4egRBVkg:3308613042 netty 1542808265607 10:51:05 2.8h x.x.x.x es2
indices:data/read/search[phase/query] xIZT5fVqQuugw2qYbBRrXA:3595912071 a98sSQVJRtefhq4egRBVkg:3308613042 netty 1542808265608 10:51:05 2.8h x.x.x.x es2

and requesting this

curl -X GET "localhost:9200/_tasks?actions=*search&detailed&pretty'

I get several of the followings

"a98sSQVJRtefhq4egRBVkg:3309107911" : {
"node" : "a98sSQVJRtefhq4egRBVkg",
"id" : 3309107911,
"type" : "transport",
"action" : "indices:data/read/search",
"description" : "indices[.kibana], types, search_type[QUERY_THEN_FETCH], source[{"from":0,"size":10000,"query":{"bool":{"filter":[{"term":{"type":{"value":"index-pattern","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"version":true,"_source":{"includes":["index-pattern.title","type","title"],"excludes":}}]",
"start_time_in_millis" : 1542809005112,
"running_time_in_nanos" : 9505448128013,
"cancellable" : true,
"headers" : { }
},

I need to terminate this tasks to make the cluster usable again, if someone has any direction to follow for solving this problem will be appreciated.

Thanks in advance.

Rod

anyone?

If you cancel a task and it doesn't go away "soon" it is a bug. What kind of bug or how we'd reproduce it, I don't know. These things happen because cancelling is a "cooperative" thing because of the constraints of the JVM. Tasks have to notice that they are cancelled and then shut down. There is no way to forcible cancel a task.

I'd file an issue with, if possible, the results of running the hot_threads API and, if possible, jstack. That'd tell us if the search query is running for a long time in some code that isn't paying attention to the cancel. We expect it to spend some time in that code, but not a long time.

I'm intentionally vague about the times here because, well, I don't remember all of this code. And also because times come from the size of the workload and things like that.

Thanks for the reply Nik

We had to restart the cluster to restore it, but the problem doesn't occur again.
It's a cluster in production, if it happen again (I hope not), I will gather the pertinent data and file the issue that you suggested.

Thanks again for the reply, and I will keep it posted if it happen again

Regards

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.