Hi All,
I am running ELK 7.6.2 stack.
Recently the cluster state went "Yellow" and it started showing one Unassigned Shard. Please see below:
{
"cluster_name" : "elkcluster-prod",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 8,
"number_of_data_nodes" : 5,
"active_primary_shards" : 187,
"active_shards" : 373,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 99.73262032085562
}
On checking cluster allocation I see the following (showing a snippet as output is long):
curl -u test:abcd -XGET http://<servername>:<port>/_cluster/allocation/explain?pretty
{
"index" : "prod_test_access-2020.09.02",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2020-09-02T09:27:29.313Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=piAWhp74QhGZ7O5BjZwxpA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:39.789Z], failed_attempts[4], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=NFp5yzjhSaWrd06501MSrQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:12.236Z], failed_attempts[3], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Rfdtjx3SRRCx66LJ-purBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:23:25.732Z], failed_attempts[2], failed_nodes[[Tnji584ORvqllqH61AjLcQ]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=3jUmAt62QqyD-j6deRhhTA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:13:58.784Z], failed_attempts[1], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], s[STARTED], a[id=hnugEI7_R9m5GuCj3ai2nA], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [989108536/943.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [988766312/942.9mb], new bytes reserved: [342224/334.2kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=346026/337.9kb, accounting=177727442/169.4mb]]; ], allocation_status[no_attempt]], expected_shard_size[26613176043], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [991023452/945.1mb], which is larger than the limit of [986061209/940.3mb], real usage: [990885000/944.9mb], new bytes reserved: [138452/135.2kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=138452/135.2kb, accounting=185692887/177mb]]; ], allocation_status[no_attempt]], expected_shard_size[20582000349], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: ShardNotFoundException[no such shard]; ], allocation_status[no_attempt]], expected_shard_size[23170768236], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [990878280/944.9mb], which is larger than the limit of [986061209/940.3mb], real usage: [990813568/944.9mb], new bytes reserved: [64712/63.1kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=1192778/1.1mb, accounting=186306611/177.6mb]]; ], allocation_status[no_attempt]], expected_shard_size[21041947489], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [986291206/940.6mb], which is larger than the limit of [986061209/940.3mb], real usage: [986193512/940.5mb], new bytes reserved: [97694/95.4kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=1157196/1.1mb, accounting=177508929/169.2mb]]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "2ULg0RFsSZiaUYm4SUpDOQ",
"node_name" : "xxx-xxx-xxx",
"transport_address" : "56.127.96.93:9300",
"node_attributes" : {
"ml.machine_memory" : "67378692096",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:27:29.313Z], failed_attempts[5], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=piAWhp74QhGZ7O5BjZwxpA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:39.789Z], failed_attempts[4], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=NFp5yzjhSaWrd06501MSrQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:25:12.236Z], failed_attempts[3], failed_nodes[[Tnji584ORvqllqH61AjLcQ, f23KRLGNRQiN8pHojqf_vg]], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Rfdtjx3SRRCx66LJ-purBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:23:25.732Z], failed_attempts[2], failed_nodes[[Tnji584ORvqllqH61AjLcQ]], delayed=false, details[failed shard on node [Tnji584ORvqllqH61AjLcQ]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[Tnji584ORvqllqH61AjLcQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=3jUmAt62QqyD-j6deRhhTA], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-02T09:13:58.784Z], failed_attempts[1], delayed=false, details[failed shard on node [f23KRLGNRQiN8pHojqf_vg]: failed to perform indices:data/write/bulk[s] on replica [prod_test_access-2020.09.02][0], node[f23KRLGNRQiN8pHojqf_vg], [R], s[STARTED], a[id=hnugEI7_R9m5GuCj3ai2nA], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [989108536/943.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [988766312/942.9mb], new bytes reserved: [342224/334.2kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=346026/337.9kb, accounting=177727442/169.4mb]]; ], allocation_status[no_attempt]], expected_shard_size[26613176043], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [991023452/945.1mb], which is larger than the limit of [986061209/940.3mb], real usage: [990885000/944.9mb], new bytes reserved: [138452/135.2kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=138452/135.2kb, accounting=185692887/177mb]]; ], allocation_status[no_attempt]], expected_shard_size[20582000349], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: ShardNotFoundException[no such shard]; ], allocation_status[no_attempt]], expected_shard_size[23170768236], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [990878280/944.9mb], which is larger than the limit of [986061209/940.3mb], real usage: [990813568/944.9mb], new bytes reserved: [64712/63.1kb], usages [request=0/0b, fielddata=609/609b, in_flight_requests=1192778/1.1mb, accounting=186306611/177.6mb]]; ], allocation_status[no_attempt]], expected_shard_size[21041947489], failure RemoteTransportException[[xxx-xxx-xxx][123-345-567:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [986291206/940.6mb], which is larger than the limit of [986061209/940.3mb], real usage: [986193512/940.5mb], new bytes reserved: [97694/95.4kb], usages [request=0/0b, fielddata=560/560b, in_flight_requests=1157196/1.1mb, accounting=177508929/169.2mb]]; ], allocation_status[no_attempt]]]"
}
]
},
Please help on how to bring the health back to green and get rid of the unassigned shard.
I see a very high index rate in console: