Elasticsearch cluster status is red. Allocate missing primary shards and replica shards

samwzm · December 15, 2018, 3:55am

Hi, there,

We just upgrade ES from 6.3.2 to 6.5.1. Our cluster includes 3 master and 2 hot data nodes. At first it works fine, until recently got restarted. The only change is the data location (we uses docker image). Since the hot data nodes almost out of space, we change the hot data node data location to the new attached volume. After restarting the cluster (which took a great effort), now the cluster is in red status:

High severity alert:
Elasticsearch cluster status is red. Allocate missing primary shards and replica shards.

Below is the allocation explain, please advise how to fix!

Thanks

GET /_cluster/allocation/explain

{
"index" : ".watches",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2018-12-14T21:25:31.671Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [2B8CjR23T5q6tKDLkuOdew]: failed recovery, failure RecoveryFailedException[[.watches][0]: Recovery failed from {es-hot-1}{xLsr0j9SREGAGLF6IRBR3A}{3V-zLIMATKaKNHpob95_bA}{54.156.46.144}{54.156.46.144:9300}{xpack.installed=true} into {es-hot-2}{2B8CjR23T5q6tKDLkuOdew}{3H5GvWeKQaugCB1wsq6r3A}{54.156.57.244}{54.156.57.244:9300}{xpack.installed=true}]; nested: RemoteTransportException[[es-hot-1][172.18.0.2:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[es-hot-2][172.18.0.2:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='Y-gGI00eSYq-svFeZE53pg_logstash_version_mismatch', type='doc', seqNo=344105, primaryTerm=25, version=70986, autoGeneratedIdTimestamp=-1}]]; nested: IllegalArgumentException[Operation term is newer than the current term; current term[24], operation term[Index{id='Y-gGI00eSYq-svFeZE53pg_logstash_version_mismatch', type='doc', seqNo=344105, primaryTerm=25, version=70986, autoGeneratedIdTimestamp=-1}]]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "2B8CjR23T5q6tKDLkuOdew",
"node_name" : "es-hot-2",
"transport_address" : "54.156.57.244:9300",
"node_attributes" : {
"xpack.installed" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-12-14T21:25:31.671Z], failed_attempts[5], delayed=false, details[failed shard on node [2B8CjR23T5q6tKDLkuOdew]: failed recovery, failure RecoveryFailedException[[.watches][0]: Recovery failed from {es-hot-1}{xLsr0j9SREGAGLF6IRBR3A}{3V-zLIMATKaKNHpob95_bA}{54.156.46.144}{54.156.46.144:9300}{xpack.installed=true} into {es-hot-2}{2B8CjR23T5q6tKDLkuOdew}{3H5GvWeKQaugCB1wsq6r3A}{54.156.57.244}{54.156.57.244:9300}{xpack.installed=true}]; nested: RemoteTransportException[[es-hot-1][172.18.0.2:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[es-hot-2][172.18.0.2:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='Y-gGI00eSYq-svFeZE53pg_logstash_version_mismatch', type='doc', seqNo=344105, primaryTerm=25, version=70986, autoGeneratedIdTimestamp=-1}]]; nested: IllegalArgumentException[Operation term is newer than the current term; current term[24], operation term[Index{id='Y-gGI00eSYq-svFeZE53pg_logstash_version_mismatch', type='doc', seqNo=344105, primaryTerm=25, version=70986, autoGeneratedIdTimestamp=-1}]]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id" : "xLsr0j9SREGAGLF6IRBR3A",
"node_name" : "es-hot-1",
"transport_address" : "54.156.46.144:9300",
"node_attributes" : {
"xpack.installed" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-12-14T21:25:31.671Z], failed_attempts[5], delayed=false, details[failed shard on node [2B8CjR23T5q6tKDLkuOdew]: failed recovery, failure RecoveryFailedException[[.watches][0]: Recovery failed from {es-hot-1}{xLsr0j9SREGAGLF6IRBR3A}{3V-zLIMATKaKNHpob95_bA}{54.156.46.144}{54.156.46.144:9300}{xpack.installed=true} into {es-hot-2}{2B8CjR23T5q6tKDLkuOdew}{3H5GvWeKQaugCB1wsq6r3A}{54.156.57.244}{54.156.57.244:9300}{xpack.installed=true}]; nested: RemoteTransportException[[es-hot-1][172.18.0.2:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[es-hot-2][172.18.0.2:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='Y-gGI00eSYq-svFeZE53pg_logstash_version_mismatch', type='doc', seqNo=344105, primaryTerm=25, version=70986, autoGeneratedIdTimestamp=-1}]]; nested: IllegalArgumentException[Operation term is newer than the current term; current term[24], operation term[Index{id='Y-gGI00eSYq-svFeZE53pg_logstash_version_mismatch', type='doc', seqNo=344105, primaryTerm=25, version=70986, autoGeneratedIdTimestamp=-1}]]; ], allocation_status[no_attempt]]]"
},
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[.watches][0], node[xLsr0j9SREGAGLF6IRBR3A], [P], s[STARTED], a[id=00UdDhMlS2Wrw4WAFplKBQ]]"
}
]
}
]
}

DavidTurner · December 16, 2018, 9:07am

Something is wrong in your environment, because it looks like a stale shard copy has been elected as primary for this shard. Are you, for instance, using ephemeral storage for your master nodes? Do any of your logs say anything about dangling indices? If so, there is a risk that you have lost some data here.

This particular unassigned shard is a replica. You should first try and address any unassigned primaries, so I would ignore this shard until your cluster health is YELLOW. Use the output of GET _cat/shards to find a shard with no assigned primary and then use the more detailed allocation explain API to work out why it's unassigned:

GET /_cluster/allocation/explain
{
  "index": "myindex",
  "shard": 0,
  "primary": true
}

system · January 13, 2019, 9:07am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Primary Shard Allocation_Failed Elasticsearch	5	1159	October 24, 2022
Elasticsearch red status Elasticsearch	6	451	July 6, 2017
Cluster State Red after node restart Elasticsearch	2	343	October 7, 2019
Elasticsearch Cluster Status is RED Elasticsearch elastic-stack-monitoring	12	708	June 29, 2021
ES cluster is red after restart Elasticsearch	2	491	July 6, 2017

Elasticsearch cluster status is red. Allocate missing primary shards and replica shards

Related topics