I have Elasticsearch cluster with 3 master nodes, 2 data nodes, and cluster node deployed on different Virtual Machine on Azure. It was working fine but suddenly it failed and now [search_phase_execution_exception] all shards failed error coming when we are trying to search any data.
There are a total of 200+ indexes with all as red status.
Following is the health of the entire cluster
And what is cluster architecture as 6 nodes but only 2 data is pretty odd - did you lose most of your data nodes somehow?
Must be some history here - any nodes VM or process restart, as something had to happen. This sit on VMWare or some shared SAN storage or something that could have failed? You can do an explain to get first unassigned and why:
GET /_cluster/allocation/explain
Which might help - any allocation awareness and maybe lost nodes or properties that prevents allocation?
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "CLUSTER_RECOVERED",
"at": "2020-07-19T16:11:27.591Z",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt",```
I am getting this response, is there any way to assign this shard again?
So what is history here and why only two data nodes? This error implies it had your indexes but lost them, likely because they are stale, i.e. there was a primary on another node that was lost before replicas were updated or something - you might check a few shards/indexes (there is an option to explain for this, see docs), but maybe the same.
I don't recall if you can promote a stale shard; I think there is an API for it but can fail, but better to find the bad nodes or understand what happened here.
And if you have snapshots, best to recover from them.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.