All shards failed for phase: [query] on Elasticsearch 7.2.0

Hey guys,
I'm running Elastic 7.2.0 on a single machine in docker containers. It works fine except for some problems with the shards.
I let Elastic run for a few days and looked at the logs today. The look fine, except for one thing: Elasticsearch threw a lot of "All shards failed"-Exceptions at some points. Please find the stacktrace below:

{"type": "server", "timestamp": "2019-07-30T15:03:57,209+0200", "level": "DEBUG", "component": "o.e.a.s.TransportSearchAction", "cluster.name": "A-Elastic-Stack", "node.name": "es01", "cluster.uuid": "qxxxxxxxxxxxxxx", "node.id": "0xxxxxxxxxxx",  "message": "All shards failed for phase: [query]"  }
{"type": "server", "timestamp": "2019-07-30T15:03:57,213+0200", "level": "WARN", "component": "r.suppressed", "cluster.name": "A-Elastic-Stack", "node.name": "es01", "cluster.uuid": "qxxxxxxxxxxxx", "node.id": "0xxxxxxxxxxxxx",  "message": "path: /.kibana_task_manager/_search, params: {ignore_unavailable=true, index=.kibana_task_manager}" ,
"stacktrace": ["org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:296) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:139) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:105) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$1(InitialSearchPhase.java:251) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:172) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.0.jar:7.2.0]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]"] }

The results of GET /_cluster/health/?level=shards look fine to me. Please find a snippet of the result below:

  "cluster_name" : "A-Elastic-Stack",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 37,
  "active_shards" : 37,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0,
  "indices" : {
    ".monitoring-logstash-7-2019.07.28" : {
      "status" : "green",
      "number_of_shards" : 1,
      "number_of_replicas" : 0,
      "active_primary_shards" : 1,
      "active_shards" : 1,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "shards" : {
        "0" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 1,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        }
      }
    },
...
    ".kibana_task_manager" : {
      "status" : "green",
      "number_of_shards" : 1,
      "number_of_replicas" : 0,
      "active_primary_shards" : 1,
      "active_shards" : 1,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "shards" : {
        "0" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 1,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        }
      }
    },

All of the shards are green.
I also noticed that Elasticsearch throws similar errors during start up occasionally. However, during the startup it differs between "are shards failed" ".security-7 failed" and no exceptions whatsoever.

Can somebody give me a hint what to look at?
Could it be relationed to the setting auto_expand_replicas, which is set to "0-1"? Is ES trying to create a replica but failing to do so since there are no other instances of ES up and running?
Thanks in advance!

Were there any other log messages around the same times? I might expect to see this while a node is starting up, or maybe shutting down, but not otherwise in a one-node cluster.

No, if Elasticsearch has only ever seen a single node then it won't have been trying to create a replica.

Hi David,
thanks for your answer!
Turns out I misread the datestamp. You are correct, the cluster was rebootet at the time.
Do you know why the shards fail while starting up?

The log message is indicating that a search has failed to search any of the shards it tried. The shards themselves are still starting up at this point, which can take some time.

Arguably the logs are being overly dramatic here, there's no need for such a noisy warning in this case. I opened an issue to discuss this further.

Ok great! Thank you for the clarification.

1 Like