Large dataset leads to TaskCancelledException: cancelled

Hi,

When watching a dashboard or browsing Discovery of a large datas et (several 100GB) sometimes Elastic will just give up with the error message below.

The cluster consists of 10 nodes with a total of 88 cores and 750GB RAM, so I'm a bit surprised it can't deal with this, especially because the CPU or RAM load doesn't seem too high.

The settings are all pretty much vanilla. Are there any I could tweak to make better use of the host resources?

The cluster runs on 7.8.1 (docker).

Any advice is appreciated!

    es01      | {"type": "server", "timestamp": "2020-08-04T13:46:29,723Z", "level": "DEBUG", "component": "o.e.a.s.TransportSearchAction", "cluster.name": "jarvis-cluster", "node.name": "es01", "message": "[packetbeat-7.2.1-2020.08.04][3], node[Corx6_gGS2u_2OF2ZDgIsQ], [P], s[STARTED], a[id=6KqQeokKSJ-eT36mOX_gxA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[packetbeat-*], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='1596545856183', requestCache=true, scroll=null, maxConcurrentShardRequests=100, batchedReduceSize=5, preFilterShardSize=1, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=false, source={\"size\":0,\"query\":{\"bool\":{\"filter\":[{\"match_all\":{\"boost\":1.0}},{\"match_all\":{\"boost\":1.0}}],\"must_not\":[{\"match_phrase\":{\"event.dataset\":{\"query\":\"flow\",\"slop\":0,\"zero_terms_query\":\"NONE\",\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"version\":true,\"_source\":{\"includes\":[],\"excludes\":[]},\"stored_fields\":\"*\",\"docvalue_fields\":[{\"field\":\"@timestamp\",\"format\":\"date_time\"},{\"field\":\"event.created\",\"format\":\"date_time\"},{\"field\":\"event.end\",\"format\":\"date_time\"},{\"field\":\"event.ingested\",\"format\":\"date_time\"},{\"field\":\"event.start\",\"format\":\"date_time\"},{\"field\":\"file.accessed\",\"format\":\"date_time\"},{\"field\":\"file.created\",\"format\":\"date_time\"},{\"field\":\"file.ctime\",\"format\":\"date_time\"},{\"field\":\"file.mtime\",\"format\":\"date_time\"},{\"field\":\"package.installed\",\"format\":\"date_time\"},{\"field\":\"process.parent.start\",\"format\":\"date_time\"},{\"field\":\"process.start\",\"format\":\"date_time\"},{\"field\":\"tls.client.not_after\",\"format\":\"date_time\"},{\"field\":\"tls.client.not_before\",\"format\":\"date_time\"},{\"field\":\"tls.client_certificate.not_after\",\"format\":\"date_time\"},{\"field\":\"tls.client_certificate.not_before\",\"format\":\"date_time\"},{\"field\":\"tls.detailed.client_certificate.not_after\",\"format\":\"date_time\"},{\"field\":\"tls.detailed.client_certificate.not_before\",\"format\":\"date_time\"},{\"field\":\"tls.detailed.server_certificate.not_after\",\"format\":\"date_time\"},{\"field\":\"tls.detailed.server_certificate.not_before\",\"format\":\"date_time\"},{\"field\":\"tls.server.not_after\",\"format\":\"date_time\"},{\"field\":\"tls.server.not_before\",\"format\":\"date_time\"},{\"field\":\"tls.server_certificate.not_after\",\"format\":\"date_time\"},{\"field\":\"tls.server_certificate.not_before\",\"format\":\"date_time\"}],\"script_fields\":{},\"track_total_hits\":2147483647,\"aggregations\":{\"maxAgg\":{\"max\":{\"field\":\"event.duration\"}},\"minAgg\":{\"min\":{\"field\":\"event.duration\"}}}}}]", "cluster.uuid": "Nqv_6mbTS5uXHpAU7Cq_EA", "node.id": "C87PAa1QSUK7N-aGjCpvqg" , 
    es01      | "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es10][123.22.0.9:9300][indices:data/read/search[phase/query]]",
    es01      | "Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: Query Failed [Failed to execute main query]",
    es01      | "at org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:323) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:151) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$22(IndicesService.java:1384) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$23(IndicesService.java:1436) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:176) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:159) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:125) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1442) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1381) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:359) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:434) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService.access$200(SearchService.java:135) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:411) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]",
    es01      | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]",
    es01      | "at java.lang.Thread.run(Thread.java:832) [?:?]",
    es01      | "Caused by: org.elasticsearch.tasks.TaskCancelledException: cancelled",
    es01      | "at org.elasticsearch.search.query.QueryPhase.lambda$executeInternal$3(QueryPhase.java:288) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.internal.ContextIndexSearcher$MutableQueryTimeout.checkCancelled(ContextIndexSearcher.java:356) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:59) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]",
    es01      | "at org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:212) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:185) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]",
    es01      | "at org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:344) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:299) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:151) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$22(IndicesService.java:1384) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$23(IndicesService.java:1436) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:176) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:159) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:125) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1442) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1381) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:359) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:434) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService.access$200(SearchService.java:135) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:411) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.1.jar:7.8.1]",
    es01      | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]",
    es01      | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]",
    es01      | "at java.lang.Thread.run(Thread.java:832) ~[?:?]"] }
    es01      | {"type": "server", "timestamp": "2020-08-04T13:46:29,727Z", "level": "DEBUG", "component": "o.e.a.a.c.n.t.c.TransportCancelTasksAction", "cluster.name": "jarvis-cluster", "node.name": "es01", "message": "Removing ban for the parent [C87PAa1QSUK7N-aGjCpvqg:223845284] on the node [C87PAa1QSUK7N-aGjCpvqg]", "cluster.uuid": "Nqv_6mbTS5uXHpAU7Cq_EA", "node.id": "C87PAa1QSUK7N-aGjCpvqg"  }

Nobody a clue? Google shows me a whooping 4 entries when searching for
"org.elasticsearch.tasks.TaskCancelledException: cancelled"

So, I'm really in need for some friendly advice :slightly_smiling_face:

What kind of storage are you using for the cluster? Local SSDs? How much data do you have in the cluster? How many indices and shards?

Hi Christian,

yes it's indeed local SSDs. Currently we're ranging between 1-4 TB of data.
Right now we've got ~50 indices, but can be significantly more ( >1000), each 5 shards. The largest indices have ~400GB data.

Now that I'm typing these numbers I remember that shards shouldn't be too big. Do you think this error is caused by the 400 / 5 = 80GB shards?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.