After some time on one of my clusters (others have identical configuration and type of data/actions but do not have following problem) I detect some strange cluster behavior:
[2019-01-24T05:32:01,537][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [c02] Received ban for the parent [zXSiTjd1Q1G9AAPo6WEmJg:1741023573] on the node [zXSiTjd1Q1G9AAPo6WEmJg], reason: [by user request]
[2019-01-24T05:32:01,538][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [c02] Sending remove ban for tasks with the parent [zXSiTjd1Q1G9AAPo6WEmJg:1741023573] to the node [w30c6uTFTAOYWdoCwtiIyw]
[2019-01-24T05:32:01,538][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [c02] Sending remove ban for tasks with the parent [zXSiTjd1Q1G9AAPo6WEmJg:1741023568] to the node [saVc3cP7QA6JTbb_vjZZ9g]
[2019-01-24T05:32:01,538][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [c02] Sending remove ban for tasks with the parent [zXSiTjd1Q1G9AAPo6WEmJg:1741023573] to the node [saVc3cP7QA6JTbb_vjZZ9g]
[2019-01-24T05:32:01,538][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [c02] Sending remove ban for tasks with the parent [zXSiTjd1Q1G9AAPo6WEmJg:1741023573] to the node [zXSiTjd1Q1G9AAPo6WEmJg]
[2019-01-24T05:32:01,538][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [c02] Removing ban for the parent [zXSiTjd1Q1G9AAPo6WEmJg:1741023573] on the node [zXSiTjd1Q1G9AAPo6WEmJg]
I come to that case by trying to understand why some Scroll API daily queries got failed, example of request:
GET xxx.2019-01/_search?scroll=5m
{
"sort": [
"_doc"
],
"size": 10000,
"_source": [
"field_1",
"field_n"
],
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"match_phrase": {
"field_1": "a"
}
},
{
"match_phrase": {
"field_1": "b"
}
}
],
"minimum_should_match": 1
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"field_n": {
"query": "XXX"
}
}
}
],
"minimum_should_match": 1
}
}
]
}
}
}
Response:
{
"_scroll_id": "DnF1ZXJ5VGhlbkZldGNoAgAAAAAIJdMoFnpYU2lUamQxUTFHOUFBUG82V0VtSmcAAAAABUsP_BZoYmFteWN3cFN1MlRfdVZjaFBubnBR",
"took": 348,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 1,
"skipped": 0,
"failed": 1,
"failures": [
{
"shard": 0,
"index": "xxx.2019-01",
"node": "hbamycwpSu2T_uVchPnnpQ",
"reason": {
"type": "task_cancelled_exception",
"reason": "cancelled"
}
}
]
},
"hits": {
"total": 8311,
"max_score": null,
"hits": []
}
}
so by some, so far unknown, reason(s) Elasticsearch start to cancel such requests from time to time, additionally here is cluster parameters that I use besides defaults:
thread_pool.search.queue_size: 5000
I would greatly appreciate if someone could help me to understand the reasons for such behavior (why "ban" happens) and possible solutions (besides request retry)
Thanks.