Elasticsearch cluster in Yellow state and 1 Unassigned Shard

Hi All,

I am running ELK 7.6.2 stack.

Recently the cluster state went "Yellow" and it started showing one Unassigned Shard. Please see below:

{
  "cluster_name" : "elkcluster-prod",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 187,
  "active_shards" : 373,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.73262032085562
}

On checking cluster allocation I see the following (showing a snippet as output is long):

curl -u test:abcd -XGET http://<servername>:<port>/_cluster/allocation/explain?pretty

{
  "index" : "prod_test_access-2020.09.02",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-09-02T09:27:29.313Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=piAWhp74QhGZ7O5BjZwxpA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:39.789Z], failed_attempts[4], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=NFp5yzjhSaWrd06501MSrQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:12.236Z], failed_attempts[3], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Rfdtjx3SRRCx66LJ-purBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:23:25.732Z], failed_attempts[2], failed_nodes[[Tnji584ORvqllqH61AjLcQ]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=3jUmAt62QqyD-j6deRhhTA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:13:58.784Z], failed_attempts[1], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], s[STARTED], a[id=hnugEI7_R9m5GuCj3ai2nA], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [989108536/943.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [988766312/942.9mb], new bytes reserved: [342224/334.2kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=346026/337.9kb, accounting=177727442/169.4mb]]; ], allocation_status[no_attempt]], expected_shard_size[26613176043], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [991023452/945.1mb], which is larger than the limit of [986061209/940.3mb], real usage: [990885000/944.9mb], new bytes reserved: [138452/135.2kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=138452/135.2kb, accounting=185692887/177mb]]; ], allocation_status[no_attempt]], expected_shard_size[20582000349], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: ShardNotFoundException[no such shard]; ], allocation_status[no_attempt]], expected_shard_size[23170768236], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [990878280/944.9mb], which is larger than the limit of [986061209/940.3mb], real usage: [990813568/944.9mb], new bytes reserved: [64712/63.1kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=1192778/1.1mb, accounting=186306611/177.6mb]]; ], allocation_status[no_attempt]], expected_shard_size[21041947489], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [986291206/940.6mb], which is larger than the limit of [986061209/940.3mb], real usage: [986193512/940.5mb], new bytes reserved: [97694/95.4kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=1157196/1.1mb, accounting=177508929/169.2mb]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "2ULg0RFsSZiaUYm4SUpDOQ",
      "node_name" : "xxx-xxx-xxx",
      "transport_address" : "56.127.96.93:9300",
      "node_attributes" : {
        "ml.machine_memory" : "67378692096",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:27:29.313Z], failed_attempts[5], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=piAWhp74QhGZ7O5BjZwxpA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:39.789Z], failed_attempts[4], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=NFp5yzjhSaWrd06501MSrQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:12.236Z], failed_attempts[3], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Rfdtjx3SRRCx66LJ-purBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:23:25.732Z], failed_attempts[2], failed_nodes[[Tnji584ORvqllqH61AjLcQ]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=3jUmAt62QqyD-j6deRhhTA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:13:58.784Z], failed_attempts[1], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], s[STARTED], a[id=hnugEI7_R9m5GuCj3ai2nA], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [989108536/943.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [988766312/942.9mb], new bytes reserved: [342224/334.2kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=346026/337.9kb, accounting=177727442/169.4mb]]; ], allocation_status[no_attempt]], expected_shard_size[26613176043], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [991023452/945.1mb], which is larger than the limit of [986061209/940.3mb], real usage: [990885000/944.9mb], new bytes reserved: [138452/135.2kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=138452/135.2kb, accounting=185692887/177mb]]; ], allocation_status[no_attempt]], expected_shard_size[20582000349], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: ShardNotFoundException[no such shard]; ], allocation_status[no_attempt]], expected_shard_size[23170768236], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [990878280/944.9mb], which is larger than the limit of [986061209/940.3mb], real usage: [990813568/944.9mb], new bytes reserved: [64712/63.1kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=1192778/1.1mb, accounting=186306611/177.6mb]]; ], allocation_status[no_attempt]], expected_shard_size[21041947489], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [986291206/940.6mb], which is larger than the limit of [986061209/940.3mb], real usage: [986193512/940.5mb], new bytes reserved: [97694/95.4kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=1157196/1.1mb, accounting=177508929/169.2mb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    

Please help on how to bring the health back to green and get rid of the unassigned shard.

I see a very high index rate in console:

1 Like

You can try an empty reroute request against the index to see if that will work. Otherwise, drop the replica and then readd it.

try:

POST_cluster/reroute?retry_failed=true
1 Like

Thanks. Can you please let me know the command to execute an empty reroute request against an index? Also how can I drop a replica and readd it? I am new to ELK.

Please note that i can not afford to remove/ lose any data.

Thanks

I would run the one that @lzukel suggested above first.

Thanks. The following command was run:

POST _cluster/reroute?retry_failed=true

It is very big output and does not fit the allowed content length of the post. Status stays as "Yellow".

Can you please let me know the command to run reroute against a single index as I know the index name that shows yellow?

What is the output from _cat/recovery/prod_test_access-2020.09.02?v?

Ran the following:

POST _cat/recovery/prod_test_access-2020.09.02?v

Getting an unauthorized response. Do I need to mention user: password in the command as security is enabled?

You need to run a GET, if you do it in Kibana Console you should be ok.

Thanks. Please see output below:

index                            shard time  type        stage source_host source_node target_host  target_node  repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
prod_test_access-2020.09.02 0     106ms empty_store done  n/a         n/a         hostnameA hostnameA n/a        n/a      0     0               0.0%          0           0     0               0.0%          0           0            0                      100.0%

OK, it'd be worth dropping the replica and then adding it back.

PUT prod_test_access-2020.09.02/_settings
{ "index" : { "number_of_replicas" : 0 } }

And then;

PUT prod_test_access-2020.09.02/_settings
{ "index" : { "number_of_replicas" : 1 } }

And that should do it!

Thanks. I attempted that through the console and both the time the output was:

{
  "acknowledged" : true
}

Still it shows Yellow as health and the unassigned shard. I hope I was not supposed to wait longer after dropping the replica? See below:

image

Allocation is not instantaneous. What is the output from the _cat/recovery endpoint from above?

Please see below:

index                            shard time  type        stage source_host  source_node  target_host  target_node  repository snapshot files files_recovered files_percent files_total bytes       bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
prod_test_access-2020.09.02 0     11.1m peer        index hostnameA hostnameA hostnameB hostnameB n/a        n/a      229   209             91.3%         229         67745704669 23951038748     35.4%         67745704669 0            0                      100.0%
prod_test_access-2020.09.02 0     106ms empty_store done  n/a          n/a          hostnameA hostnameA n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%

RIghto, so that indicates that the recovery process is progressing.

Thank you so much for the help! Its 100% done and the cluster is green now.

So we dropped the replica and then rebuilt it which took care of the unassigned Shard.

1 Like

No worries.

Oh it turned Yellow and shows 1 unassigned shard again!

And it went back to Green. I will give it some time.

Thanks for all the help

Checked today and the same index for today's date shows Yellow status and an unassigned shard.

{
  "cluster_name" : "elkcluster-prod",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 191,
  "active_shards" : 381,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.73821989528795
}