Elasticsearch All Shards failed on cluster with multiple nodes on azure VM

I have Elasticsearch cluster with 3 master nodes, 2 data nodes, and cluster node deployed on different Virtual Machine on Azure. It was working fine but suddenly it failed and now [search_phase_execution_exception] all shards failed error coming when we are trying to search any data.
There are a total of 200+ indexes with all as red status.
Following is the health of the entire cluster

"status": "red",
"timed_out": false,
"number_of_nodes": 6,
"number_of_data_nodes": 2,
"active_primary_shards": 5,
"active_shards": 10,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 524,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 1.8726591760299627

What could be the possible solution for this?
Help would be appreciated.

Which version are you using? Is there anything in the Elasticsearch logs that provides any clues?

And what is cluster architecture as 6 nodes but only 2 data is pretty odd - did you lose most of your data nodes somehow?

Must be some history here - any nodes VM or process restart, as something had to happen. This sit on VMWare or some shared SAN storage or something that could have failed? You can do an explain to get first unassigned and why:

GET /_cluster/allocation/explain

Which might help - any allocation awareness and maybe lost nodes or properties that prevents allocation?

6.5.3

Thanks Steve, I am trying now with allocation/explain

Hello @Steve_Mushero,

        "primary": true,
        "current_state": "unassigned",
        "unassigned_info": {
            "reason": "CLUSTER_RECOVERED",
            "at": "2020-07-19T16:11:27.591Z",
            "last_allocation_status": "no_valid_shard_copy"
        },
        "can_allocate": "no_valid_shard_copy",
        "allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt",``` 

I am getting this response, is there any way to assign this shard again?

So what is history here and why only two data nodes? This error implies it had your indexes but lost them, likely because they are stale, i.e. there was a primary on another node that was lost before replicas were updated or something - you might check a few shards/indexes (there is an option to explain for this, see docs), but maybe the same.

I don't recall if you can promote a stale shard; I think there is an API for it but can fail, but better to find the bad nodes or understand what happened here.

And if you have snapshots, best to recover from them.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.