Elasticsearch cluster in Yellow state and 1 Unassigned Shard

zaeemmasood · September 2, 2020, 9:26pm

Hi All,

I am running ELK 7.6.2 stack.

Recently the cluster state went "Yellow" and it started showing one Unassigned Shard. Please see below:

{
  "cluster_name" : "elkcluster-prod",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 187,
  "active_shards" : 373,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.73262032085562
}

On checking cluster allocation I see the following (showing a snippet as output is long):

curl -u test:abcd -XGET http://<servername>:<port>/_cluster/allocation/explain?pretty

{
  "index" : "prod_test_access-2020.09.02",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-09-02T09:27:29.313Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=piAWhp74QhGZ7O5BjZwxpA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:39.789Z], failed_attempts[4], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=NFp5yzjhSaWrd06501MSrQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:12.236Z], failed_attempts[3], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Rfdtjx3SRRCx66LJ-purBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:23:25.732Z], failed_attempts[2], failed_nodes[[Tnji584ORvqllqH61AjLcQ]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=3jUmAt62QqyD-j6deRhhTA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:13:58.784Z], failed_attempts[1], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], s[STARTED], a[id=hnugEI7_R9m5GuCj3ai2nA], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [989108536/943.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [988766312/942.9mb], new bytes reserved: [342224/334.2kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=346026/337.9kb, accounting=177727442/169.4mb]]; ], allocation_status[no_attempt]], expected_shard_size[26613176043], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [991023452/945.1mb], which is larger than the limit of [986061209/940.3mb], real usage: [990885000/944.9mb], new bytes reserved: [138452/135.2kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=138452/135.2kb, accounting=185692887/177mb]]; ], allocation_status[no_attempt]], expected_shard_size[20582000349], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: ShardNotFoundException[no such shard]; ], allocation_status[no_attempt]], expected_shard_size[23170768236], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [990878280/944.9mb], which is larger than the limit of [986061209/940.3mb], real usage: [990813568/944.9mb], new bytes reserved: [64712/63.1kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=1192778/1.1mb, accounting=186306611/177.6mb]]; ], allocation_status[no_attempt]], expected_shard_size[21041947489], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [986291206/940.6mb], which is larger than the limit of [986061209/940.3mb], real usage: [986193512/940.5mb], new bytes reserved: [97694/95.4kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=1157196/1.1mb, accounting=177508929/169.2mb]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "2ULg0RFsSZiaUYm4SUpDOQ",
      "node_name" : "xxx-xxx-xxx",
      "transport_address" : "56.127.96.93:9300",
      "node_attributes" : {
        "ml.machine_memory" : "67378692096",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:27:29.313Z], failed_attempts[5], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=piAWhp74QhGZ7O5BjZwxpA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:39.789Z], failed_attempts[4], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=NFp5yzjhSaWrd06501MSrQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:12.236Z], failed_attempts[3], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Rfdtjx3SRRCx66LJ-purBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:23:25.732Z], failed_attempts[2], failed_nodes[[Tnji584ORvqllqH61AjLcQ]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=3jUmAt62QqyD-j6deRhhTA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:13:58.784Z], failed_attempts[1], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], s[STARTED], a[id=hnugEI7_R9m5GuCj3ai2nA], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [989108536/943.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [988766312/942.9mb], new bytes reserved: [342224/334.2kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=346026/337.9kb, accounting=177727442/169.4mb]]; ], allocation_status[no_attempt]], expected_shard_size[26613176043], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [991023452/945.1mb], which is larger than the limit of [986061209/940.3mb], real usage: [990885000/944.9mb], new bytes reserved: [138452/135.2kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=138452/135.2kb, accounting=185692887/177mb]]; ], allocation_status[no_attempt]], expected_shard_size[20582000349], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: ShardNotFoundException[no such shard]; ], allocation_status[no_attempt]], expected_shard_size[23170768236], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [990878280/944.9mb], which is larger than the limit of [986061209/940.3mb], real usage: [990813568/944.9mb], new bytes reserved: [64712/63.1kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=1192778/1.1mb, accounting=186306611/177.6mb]]; ], allocation_status[no_attempt]], expected_shard_size[21041947489], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [986291206/940.6mb], which is larger than the limit of [986061209/940.3mb], real usage: [986193512/940.5mb], new bytes reserved: [97694/95.4kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=1157196/1.1mb, accounting=177508929/169.2mb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },

Please help on how to bring the health back to green and get rid of the unassigned shard.

I see a very high index rate in console:

warkolm · September 2, 2020, 9:44pm

You can try an empty reroute request against the index to see if that will work. Otherwise, drop the replica and then readd it.

lzukel · September 2, 2020, 9:50pm

try:

POST_cluster/reroute?retry_failed=true

zaeemmasood · September 2, 2020, 10:35pm

Thanks. Can you please let me know the command to execute an empty reroute request against an index? Also how can I drop a replica and readd it? I am new to ELK.

Please note that i can not afford to remove/ lose any data.

Thanks

warkolm · September 2, 2020, 10:41pm

I would run the one that @lzukel suggested above first.

zaeemmasood · September 2, 2020, 11:16pm

Thanks. The following command was run:

POST _cluster/reroute?retry_failed=true

It is very big output and does not fit the allowed content length of the post. Status stays as "Yellow".

Can you please let me know the command to run reroute against a single index as I know the index name that shows yellow?

warkolm · September 2, 2020, 11:21pm

What is the output from _cat/recovery/prod_test_access-2020.09.02?v?

zaeemmasood · September 2, 2020, 11:31pm

Ran the following:

POST _cat/recovery/prod_test_access-2020.09.02?v

Getting an unauthorized response. Do I need to mention user: password in the command as security is enabled?

warkolm · September 2, 2020, 11:36pm

You need to run a GET, if you do it in Kibana Console you should be ok.

zaeemmasood · September 2, 2020, 11:41pm

Thanks. Please see output below:

index                            shard time  type        stage source_host source_node target_host  target_node  repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
prod_test_access-2020.09.02 0     106ms empty_store done  n/a         n/a         hostnameA hostnameA n/a        n/a      0     0               0.0%          0           0     0               0.0%          0           0            0                      100.0%

warkolm · September 3, 2020, 12:36am

OK, it'd be worth dropping the replica and then adding it back.

PUT prod_test_access-2020.09.02/_settings
{ "index" : { "number_of_replicas" : 0 } }

And then;

PUT prod_test_access-2020.09.02/_settings
{ "index" : { "number_of_replicas" : 1 } }

And that should do it!

zaeemmasood · September 3, 2020, 1:26am

Thanks. I attempted that through the console and both the time the output was:

{
  "acknowledged" : true
}

Still it shows Yellow as health and the unassigned shard. I hope I was not supposed to wait longer after dropping the replica? See below:

warkolm · September 3, 2020, 1:28am

Allocation is not instantaneous. What is the output from the _cat/recovery endpoint from above?

zaeemmasood · September 3, 2020, 1:31am

Please see below:

index                            shard time  type        stage source_host  source_node  target_host  target_node  repository snapshot files files_recovered files_percent files_total bytes       bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
prod_test_access-2020.09.02 0     11.1m peer        index hostnameA hostnameA hostnameB hostnameB n/a        n/a      229   209             91.3%         229         67745704669 23951038748     35.4%         67745704669 0            0                      100.0%
prod_test_access-2020.09.02 0     106ms empty_store done  n/a          n/a          hostnameA hostnameA n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%

warkolm · September 3, 2020, 1:32am

RIghto, so that indicates that the recovery process is progressing.

zaeemmasood · September 3, 2020, 1:55am

Thank you so much for the help! Its 100% done and the cluster is green now.

So we dropped the replica and then rebuilt it which took care of the unassigned Shard.

warkolm · September 3, 2020, 1:55am

No worries.

zaeemmasood · September 3, 2020, 1:58am

Oh it turned Yellow and shows 1 unassigned shard again!

zaeemmasood · September 3, 2020, 2:01am

And it went back to Green. I will give it some time.

Thanks for all the help

zaeemmasood · September 3, 2020, 1:15pm

Checked today and the same index for today's date shows Yellow status and an unassigned shard.

{
  "cluster_name" : "elkcluster-prod",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 191,
  "active_shards" : 381,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.73821989528795
}

Topic		Replies	Views
Yellow Status- Unassigned Shards Elasticsearch	11	6992	October 4, 2019
Red status unassigned shards help Elasticsearch	8	569	July 6, 2017
Unassigned shards, v2 Elasticsearch	5	1344	July 6, 2017
ES Cluster in Yellow with Unassigned Shards flopping in and out of Unassigned Elasticsearch	4	1358	January 19, 2017
Unassigned primary and replica shards Elasticsearch	6	2118	July 6, 2017

Elasticsearch cluster in Yellow state and 1 Unassigned Shard

Related topics