All shards failing error

I am having problems with an application (Arkime) which uses ES to store its data. When it does long searches these time out and on looking at the logs I find errors from ES. The same errors are in the ES logs and I have included them here along with the full traceback:

[2021-03-05T12:03:06,541][WARN ][r.suppressed             ] [secesprd02] path: /sessions2-210304%2Csessions2-210303/session/_search, params: {rest_total_hits_as_int=true, ignore_unavailable=true, preference=primaries, index=sessions2-210304,sessions2-210303, type=session}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:568) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:324) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:230) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:112) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.CountedCollector.countDown(CountedCollector.java:51) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.CountedCollector.onFailure(CountedCollector.java:70) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.FetchSearchPhase$2.onFailure(FetchSearchPhase.java:194) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:408) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:670) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:712) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:704) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.SearchTransportService.sendExecuteFetch(SearchTransportService.java:174) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.SearchTransportService.sendExecuteFetch(SearchTransportService.java:164) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.FetchSearchPhase.executeFetch(FetchSearchPhase.java:176) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:156) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.FetchSearchPhase.access$000(FetchSearchPhase.java:47) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:95) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0.jar:7.10.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.tasks.TaskCancelledException: cancelled
        at org.elasticsearch.search.query.QueryPhase.lambda$executeInternal$3(QueryPhase.java:285) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.internal.ContextIndexSearcher$MutableQueryTimeout.checkCancelled(ContextIndexSearcher.java:370) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:54) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
        at org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:226) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:199) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
        at org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:341) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:296) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:148) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:431) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.SearchService.access$500(SearchService.java:141) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:401) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.0.jar:7.10.0]
        ... 7 more

Any ideas how to diagnose what is causing the shards to fail? I can't see any clues in the logs. That index has two primary shards on different servers and no replicas. There are 185 segments. The index contains just over 400GB of data and half a billion documents (this is flow data at the border of a large university). The indexes are under ILM

The search was cancelled, which is typically because the client (or a proxy) closed the connection before Elasticsearch sent its response.

hmmm... I did see that but I am pretty sure the client did not explicitly cancel. Something else (proxy) may have timed out the tcp session. Would that fit the message? This started happening recently and I suspect something between the client and the server has changed.

It only happens on searches that take a long time.

Yes, a connection closed by a proxy or other intermediary would have the same effect. It doesn't matter to Elasticsearch who caused the connection to close - in fact the two cases are identical from Elasticsearch's point of view.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.