Elasticsearch cluster is in Red state. How to recover it?

chaitra_hegde · November 12, 2019, 5:53am

Hi,
I am using Elasticsearch 6.6.1 in k8s environment. My cluster was in green state before. But now my cluster is in red state due to UNASSIGNED shards. I see many shards are in PRIMARY_FAILED state.

You can find the details below.
Response from primary shard:
curl -X GET "http://xx.xx.xx.xx:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
"index": "log-2019-10-08",
"shard": 9,
"primary": true
}
'
{
"index" : "log-2019-10-08",
"shard" : 9,
"primary" : true,
"current_state" : "initializing",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2019-10-22T09:26:26.488Z",
"details" : "node_left[UVNJBTB8SuC3OiVnaB4Tfw]",
"last_allocation_status" : "awaiting_info"
},
"current_node" : {
"id" : "UVNJBTB8SuC3OiVnaB4Tfw",
"name" : "elasticsearch-data-4",
"transport_address" : "xx.xx.xx.xx:9300"
},
"explanation" : "the shard is in the process of initializing on node [elasticsearch-data-4], wait until initialization has completed"
}

The replica response is below.
curl -X GET "http://xx.xx.xx.xx:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
"index": "log-2019-10-08",
"shard": 9,
"primary": false
}
'
{
"index" : "log-2019-10-08",
"shard" : 9,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "PRIMARY_FAILED",
"at" : "2019-10-21T20:06:43.919Z",
"details" : "primary failed while replica initializing",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "Ljab6IuXQTOUNxk8RkcuGg",
"node_name" : "elasticsearch-data-1",
"transport_address" : "xx.xx.xx.xx:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
},
{
"node_id" : "UVNJBTB8SuC3OiVnaB4Tfw",
"node_name" : "elasticsearch-data-4",
"transport_address" : "xx.xx.xx.xx:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[log-2019-10-08][9], node[UVNJBTB8SuC3OiVnaB4Tfw], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=clqmzyGgSQC4HBKolutV-Q], unassigned_info[[reason=NODE_LEFT], at[2019-10-22T09:26:26.488Z], delayed=false, details[node_left[UVNJBTB8SuC3OiVnaB4Tfw]], allocation_status[fetching_shard_data]]]"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
},
{
"node_id" : "V4qtWbtLRqyDqW9f6T0mog",
"node_name" : "elasticsearch-data-2",
"transport_address" : "xx.xx.xx.xx:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
},
{
"node_id" : "r6UCUEPzR6aY0Kz8NiauDg",
"node_name" : "elasticsearch-data-0",
"transport_address" : "xx.xx.xx.xx:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
},
{
"node_id" : "sNmtl-VvQMqS2bcXEycB-g",
"node_name" : "elasticsearch-data-3",
"transport_address" : "xx.xx.xx.xx:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
}
]
}
How can I bring back my cluster to healthy state?

Christian_Dahlqvist · November 12, 2019, 6:26am

What is the full output of the cluster health API?

chaitra_hegde · November 12, 2019, 7:16am

Response from cluster health API is below:
curl -XGET xx.xx.xx.xx:9200/_cluster/health?pretty
{
"cluster_name" : "my-cluster-1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 11,
"number_of_data_nodes" : 6,
"active_primary_shards" : 494,
"active_shards" : 716,
"relocating_shards" : 0,
"initializing_shards" : 449,
"unassigned_shards" : 401,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 20,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 6691643,
"active_shards_percent_as_number" : 45.721583652618136
}

Christian_Dahlqvist · November 12, 2019, 8:01am

How did you end up in this state? Are all original data nodes now part of the cluster?

chaitra_hegde · November 12, 2019, 9:07am

Hi,
Two of the worker node had gone down in few hours time difference. When we recovered the worker nodes the issue started appearing in elasticsearch cluster.
All the original data nodes are now the part of cluster.

DavidTurner · November 12, 2019, 11:14am

The primary shard you looked at above is recovering:

This means you should just wait and eventually it will recover.

DavidTurner · November 12, 2019, 12:23pm

This suggests you have increased a setting such as cluster.routing.allocation.node_concurrent_recoveries far too high. Your cluster may be deadlocked. Could you set it (and any other related settings) back to the default and perform a full cluster restart?

chaitra_hegde · November 18, 2019, 6:45am

But the primary shards are in intializing state since 4-5days.

chaitra_hegde · November 18, 2019, 6:46am

Hi,
I am using k8s environment. What do you mean by full cluster restart? How can I restart my ES cluster?

DavidTurner · November 18, 2019, 8:32am

I don't know about Kubernetes specifically, but a full cluster restart is where you shut all of the nodes down and then start them all up again.

chaitra_hegde · November 18, 2019, 9:38am

Hi,
On what all conditions/scenarios can cluster.routing.allocation.node_concurrent_recoveries be tuned to other values than default values?

DavidTurner · November 18, 2019, 10:26am

I would only use this parameter for experiments in a test environment. I would not recommend adjusting it from the default in a production environment.

alex_polisevschi · November 19, 2019, 6:13pm

Have you tried the reroute API?

POST /_cluster/reroute?retry_failed=true

The cluster will attempt to allocate a shard a maximum of index.allocation.max_retries times in a row (defaults to 5 ), before giving up and leaving the shard unallocated. This scenario can be caused by structural problems such as having an analyzer which refers to a stopwords file which doesn’t exist on all nodes.

Once the problem has been corrected, allocation can be manually retried by calling the reroute API with the ?retry_failed URI query parameter, which will attempt a single retry round for these shards.

chaitra_hegde · November 27, 2019, 6:58am

Hi,
I have set back the cluster.routing.allocation.node_concurrent_recoveries to default value and performed a full cluster restart. And also i have performed reroute API with ?retry_failed.
Now I am able to reduce the number of shards which are in red state. So after performing this i have 66 unassigned shards.
Now I have many indices with some shards are in red state. Since i have recovered some of the shards of that index, I do not want to delete the full index in red state to bring back my cluster to healthy state.
So how can I delete particular red shard of an index without deleting the whole index?

system · December 25, 2019, 6:58am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster is Red , the service is still running Elasticsearch	1	948	November 28, 2017
Elasticsearch cluster status is red. Allocate missing primary shards and replica shards Elasticsearch	2	12202	January 13, 2019
ES cluster is red after restart Elasticsearch	2	510	July 6, 2017
How to resolve elasticsearch status red Elasticsearch	4	6526	July 6, 2017
Cluster turns to red after reboot Elasticsearch	29	2866	January 4, 2019

Elasticsearch cluster is in Red state. How to recover it?

Related topics