Elasticsearch All Shards failed on cluster with multiple nodes on azure VM

pranav_patwardhan · July 19, 2020, 5:31pm

I have Elasticsearch cluster with 3 master nodes, 2 data nodes, and cluster node deployed on different Virtual Machine on Azure. It was working fine but suddenly it failed and now [search_phase_execution_exception] all shards failed error coming when we are trying to search any data.
There are a total of 200+ indexes with all as red status.
Following is the health of the entire cluster

"status": "red",
"timed_out": false,
"number_of_nodes": 6,
"number_of_data_nodes": 2,
"active_primary_shards": 5,
"active_shards": 10,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 524,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 1.8726591760299627

What could be the possible solution for this?
Help would be appreciated.

Christian_Dahlqvist · July 19, 2020, 7:27pm

Which version are you using? Is there anything in the Elasticsearch logs that provides any clues?

Steve_Mushero · July 20, 2020, 7:57am

And what is cluster architecture as 6 nodes but only 2 data is pretty odd - did you lose most of your data nodes somehow?

Must be some history here - any nodes VM or process restart, as something had to happen. This sit on VMWare or some shared SAN storage or something that could have failed? You can do an explain to get first unassigned and why:

GET /_cluster/allocation/explain

Which might help - any allocation awareness and maybe lost nodes or properties that prevents allocation?

pranav_patwardhan · July 21, 2020, 11:33am

6.5.3

pranav_patwardhan · July 21, 2020, 2:02pm

Thanks Steve, I am trying now with allocation/explain

pranav_patwardhan · July 21, 2020, 2:53pm

Hello @Steve_Mushero,

        "primary": true,
        "current_state": "unassigned",
        "unassigned_info": {
            "reason": "CLUSTER_RECOVERED",
            "at": "2020-07-19T16:11:27.591Z",
            "last_allocation_status": "no_valid_shard_copy"
        },
        "can_allocate": "no_valid_shard_copy",
        "allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt",``` 

I am getting this response, is there any way to assign this shard again?

Steve_Mushero · July 22, 2020, 2:39am

So what is history here and why only two data nodes? This error implies it had your indexes but lost them, likely because they are stale, i.e. there was a primary on another node that was lost before replicas were updated or something - you might check a few shards/indexes (there is an option to explain for this, see docs), but maybe the same.

I don't recall if you can promote a stale shard; I think there is an API for it but can fail, but better to find the bad nodes or understand what happened here.

And if you have snapshots, best to recover from them.

system · August 19, 2020, 2:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data node lost, all shards go to RED - Data node returns but shards lost forever Elasticsearch	10	3603	July 24, 2019
Shards unassigned after some nodes went down Elasticsearch	8	449	September 29, 2020
Shards are in ALLOCATION_FAILED or CLUSTER_RECOVERED Elasticsearch elastic-stack-monitoring	4	2736	August 21, 2023
Shards are not allocating to available node Elasticsearch	2	1273	October 19, 2018
Elastic Search Cluster has unassigned Shards Elasticsearch	4	3315	May 25, 2017

Elasticsearch All Shards failed on cluster with multiple nodes on azure VM

Related topics