This happened to me on a test bed running the 6.7 ELK stack on Kubernetes (3 ingest, 3 master, 3 32TB datanodes) with the configuration set to 5 Shards, 1 Replica.
Kibana renders a json "message":"all shards failed: [search_phase_execution_exception] all shards failed"
** API CALLS AND RESULTS:
_cat/indices: indeed showed all indices were red.
_cluster/allocation/explain: listed one index shard as "unassigned" with reason "NODE_LEFT", "no_valid_shard_copy"
_cat/shards: showed every indices with at least one shard's primary and secondary as "UNASSIGNED"
** POD INVESTIGATION:
By the time I discovered the datanode went down it was already back up. So k8s showed the datanode back in the cluster and logs on that datanode showed no indication of anything bad. The other two datanodes had logged a 'cannot reach data node 2' exception at the same time. Terminaling into the datanode 2 container I perused the datanode data folder. /usr/share/elasticsearch/data/nodes/0/indices had a large number of indices on disk.
** THE "FIX"
I tried closing and reopening indices hoping that it would try to find the indices that appeared to be in the datanode2. This did nothing. Shards still missing, Index still red.
A lot of reading online didn't lead me to anything obvious that seemed like it would "rediscover" those lost shards on datanode2 so I turned my attention to just getting the cluster back to Green. After much reading I found the reroute api using "allocate_empty_primary" on each and every red index (which was nearly all)
** THE REMAINING QUESTIONS
So a few questions came out of this for me:
- Why, when I only lost one datanode, did my shards go to "red" and did I lose entire shard sets (primary/secondary)? I can understand losing either, but not both since ES never allocates both to the same node.
- Why, when the node came back online did ES not, eventually, re-discover the unassigned shards?
- Rerouting the shards got me back to green with data loss, but how could I have forced ES to search the rejoined datanode for missing shard primary or replica?
Any help/insight would be greatly appreciated. Thank you in advance!