I am having problems with an application (Arkime) which uses ES to store its data. When it does long searches these time out and on looking at the logs I find errors from ES. The same errors are in the ES logs and I have included them here along with the full traceback:
[2021-03-05T12:03:06,541][WARN ][r.suppressed ] [secesprd02] path: /sessions2-210304%2Csessions2-210303/session/_search, params: {rest_total_hits_as_int=true, ignore_unavailable=true, preference=primaries, index=sessions2-210304,sessions2-210303, type=session}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:568) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:324) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:230) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:112) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.CountedCollector.countDown(CountedCollector.java:51) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.CountedCollector.onFailure(CountedCollector.java:70) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase$2.onFailure(FetchSearchPhase.java:194) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:408) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:670) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:712) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:704) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.SearchTransportService.sendExecuteFetch(SearchTransportService.java:174) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.SearchTransportService.sendExecuteFetch(SearchTransportService.java:164) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase.executeFetch(FetchSearchPhase.java:176) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:156) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase.access$000(FetchSearchPhase.java:47) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:95) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0.jar:7.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.tasks.TaskCancelledException: cancelled
at org.elasticsearch.search.query.QueryPhase.lambda$executeInternal$3(QueryPhase.java:285) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.internal.ContextIndexSearcher$MutableQueryTimeout.checkCancelled(ContextIndexSearcher.java:370) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:54) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
at org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:226) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:199) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
at org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:341) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:296) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:148) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:431) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.SearchService.access$500(SearchService.java:141) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:401) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.0.jar:7.10.0]
... 7 more
Any ideas how to diagnose what is causing the shards to fail? I can't see any clues in the logs. That index has two primary shards on different servers and no replicas. There are 185 segments. The index contains just over 400GB of data and half a billion documents (this is flow data at the border of a large university). The indexes are under ILM