All shards failed across multiple indexes

We are using 7.7.1 ELK and have a simple, single node, setup with a couple of dozen indexes. I'm finding a very large number of "all shards failed" messages in the logs, and have seen a connection failure for a minute or more during some of these times. I am operating on the assumption that these connection failures are occurring in all instances.

I have tried to find information that would be constructive in resolving this, but haven't been successful. Is anyone able to point me in the right direction?

The main part of the error message is:

org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed

Many thanks

can you provide more errors logs than just a single line. The whole stack trace might contain more information to figure out the problem here.

Also enabled stack traces to get them as part of the response might help, if you have figured out a query triggering those errors. See Common options | Elasticsearch Guide [8.1] | Elastic

Thank you for the response.

Complete stack trace is below. Your comment above implies that it may be caused by a search being run, but I've seen these errors occurring on indexes that were not currently in use (eg training and development systems).

Enabling and capturing the stack traces as a part of the query response would be problematic and not something that could be done quickly, but I will keep that change in mind.

[2022-03-31T08:00:30,381][WARN ][r.suppressed             ] [JIRA] path: /<index_name>/_search, params: {typed_keys=true, scroll=2m, index=<index_name>, filter_path=hits.hits._id}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:580) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:68) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:245) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:402) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1248) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1222) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:60) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:56) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:413) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.1.jar:7.7.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.tasks.TaskCancelledException: cancelled
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:150) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:463) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:443) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.access$200(SearchService.java:135) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:411) ~[elasticsearch-7.7.1.jar:7.7.1]
	... 6 more

I don't see enough of the stack trace, any chance for a more verbose one or is that all you got?

Hi,

Thanks for the response. I actually managed to capture this at TRACE level yesterday. I'm curious about the cluster:admin/tasks/cancel[n] and internal:admin/tasks/ban entries.

Again, any assistance or pointers would be appreciated.

[2022-04-07T15:10:56,187][TRACE][o.e.t.T.tracer           ] [JIRA] [7660063][indices:data/read/search[phase/query]] received request
[2022-04-07T15:10:56,187][TRACE][o.e.t.T.tracer           ] [JIRA] [7660063][indices:data/read/search[phase/query]] sent response
[2022-04-07T15:10:56,187][TRACE][o.e.t.T.tracer           ] [JIRA] [7660063][indices:data/read/search[phase/query]] received response from [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]
[2022-04-07T15:10:56,231][TRACE][o.e.t.T.tracer           ] [JIRA] [7660064][cluster:admin/tasks/cancel[n]] sent to [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] (timeout: [null])
[2022-04-07T15:10:56,231][TRACE][o.e.t.T.tracer           ] [JIRA] [7660064][cluster:admin/tasks/cancel[n]] received request
[2022-04-07T15:10:56,231][TRACE][o.e.t.T.tracer           ] [JIRA] [7660065][internal:admin/tasks/ban] sent to [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] (timeout: [null])
[2022-04-07T15:10:56,231][TRACE][o.e.t.T.tracer           ] [JIRA] [7660065][internal:admin/tasks/ban] received request
[2022-04-07T15:10:56,231][TRACE][o.e.t.T.tracer           ] [JIRA] [7660065][internal:admin/tasks/ban] sent response
[2022-04-07T15:10:56,231][TRACE][o.e.t.T.tracer           ] [JIRA] [7660065][internal:admin/tasks/ban] received response from [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]
[2022-04-07T15:10:56,231][TRACE][o.e.t.T.tracer           ] [JIRA] [7660064][cluster:admin/tasks/cancel[n]] sent response
[2022-04-07T15:10:56,232][TRACE][o.e.t.T.tracer           ] [JIRA] [7660064][cluster:admin/tasks/cancel[n]] received response from [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]
[2022-04-07T15:10:56,236][TRACE][o.e.t.T.tracer           ] [JIRA] [7660066][indices:data/read/search[phase/query]] sent to [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] (timeout: [null])
[2022-04-07T15:10:56,236][TRACE][o.e.t.T.tracer           ] [JIRA] [7660066][indices:data/read/search[phase/query]] received request
[2022-04-07T15:10:56,237][TRACE][o.e.t.T.tracer           ] [JIRA] [7660066][indices:data/read/search[phase/query]] sent response
[2022-04-07T15:10:56,237][TRACE][o.e.t.T.tracer           ] [JIRA] [7660066][indices:data/read/search[phase/query]] received response from [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]
[2022-04-07T15:10:56,298][TRACE][o.e.t.T.tracer           ] [JIRA] [7659189][indices:data/read/search[phase/query]] sent error response
org.elasticsearch.tasks.TaskCancelledException: cancelled
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:150) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:463) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:443) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.access$200(SearchService.java:135) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:411) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.1.jar:7.7.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
[2022-04-07T15:10:56,298][TRACE][o.e.t.T.tracer           ] [JIRA] [7659189][indices:data/read/search[phase/query]] received response from [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]
[2022-04-07T15:10:56,299][TRACE][o.e.t.T.tracer           ] [JIRA] [7660067][internal:admin/tasks/ban] sent to [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] (timeout: [null])
[2022-04-07T15:10:56,299][TRACE][o.e.t.T.tracer           ] [JIRA] [7660067][internal:admin/tasks/ban] received request
[2022-04-07T15:10:56,299][TRACE][o.e.t.T.tracer           ] [JIRA] [7660067][internal:admin/tasks/ban] sent response
[2022-04-07T15:10:56,299][TRACE][o.e.t.T.tracer           ] [JIRA] [7660067][internal:admin/tasks/ban] received response from [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]
[2022-04-07T15:10:56,299][WARN ][r.suppressed             ] [JIRA] path: /<index name>/_search, params: {typed_keys=true, scroll=2m, index=<index name>, filter_path=hits.hits._id}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:580) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:68) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:245) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:402) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1248) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1222) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:60) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:56) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:413) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.1.jar:7.7.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.tasks.TaskCancelledException: cancelled
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:150) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:463) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:443) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.access$200(SearchService.java:135) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:411) ~[elasticsearch-7.7.1.jar:7.7.1]
	... 6 more
[2022-04-07T15:10:56,312][TRACE][o.e.t.T.tracer           ] [JIRA] [7660068][indices:data/read/search[phase/query]] sent to [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] (timeout: [null])
[2022-04-07T15:10:56,312][TRACE][o.e.t.T.tracer           ] [JIRA] [7660068][indices:data/read/search[phase/query]] received request
[2022-04-07T15:10:56,312][TRACE][o.e.t.T.tracer           ] [JIRA] [7660068][indices:data/read/search[phase/query]] sent response
[2022-04-07T15:10:56,312][TRACE][o.e.t.T.tracer           ] [JIRA] [7660068][indices:data/read/search[phase/query]] received response from [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]
[2022-04-07T15:10:56,407][TRACE][o.e.t.T.tracer           ] [JIRA] [7660069][indices:data/read/search[phase/query]] sent to [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] (timeout: [null])
[2022-04-07T15:10:56,407][TRACE][o.e.t.T.tracer           ] [JIRA] [7660069][indices:data/read/search[phase/query]] received request
[2022-04-07T15:10:56,408][TRACE][o.e.t.T.tracer           ] [JIRA] [7660069][indices:data/read/search[phase/query]] sent response
[2022-04-07T15:10:56,408][TRACE][o.e.t.T.tracer           ] [JIRA] [7660069][indices:data/read/search[phase/query]] received response from [{JIRA}{b3eY_XZESy6UOziLW6VTkQ}{hj66fAItSS-zauV242yHig}{JIRA}{192.168.51.52:9300}{dilmrt}{ml.machine_memory=34359267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]

Cheers

This is still occurring (regularly) if anyone can offer advice?