ES Version : 7.0.1
Cluster Details:
3 physical nodes each running 4 instances of ES - total 12 instances of Cluster
Problem description:
As part of testing, performed a full cluster-reboot, noticed that some shards (please see cluster-health output) reached UNASSIGNED
state and cluster-status became RED
and remained
in that state for very long time - 6-7 hours.
Cluster Health as seen during the initial problem-stage
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 12,
"number_of_data_nodes" : 12,
"active_primary_shards" : 814,
"active_shards" : 1618,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 22,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.65853658536585
}
Hence, attempted reroute (POST "/_cluster/reroute?retry_failed=true"
) to recover the long-failing shards. After a couple of hours of RELOCATING
of some shards, cluster-status
turned GREEN
And the cluster was left in that stable-state undisturbed. After 10-12 hrs while checking the cluster-health again, randomly some shards went into UNASSIGNED
state and cluster state turned RED
. Examined if there were any reboots of any instances of ES and no such conditions observed.
When explain API was run at this juncture, noticed following cluster-health and explain API outputs
Cluster Health:
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 12,
"number_of_data_nodes" : 12,
"active_primary_shards" : 809,
"active_shards" : 1605,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 25,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.46625766871165
}
Corresponding Explain API output
Single Sample snippet is as follows (all unassigned shards permutations of below negative deciders
as reason for UNASSIGNED
state):
{
"node_id" : "wbr4ZGYJRnycM0u94urCNg",
"node_name" : "elasticsearch-6",
"transport_address" : "AA.BB.MM.YY:9300",
"node_attributes" : {
"ml.machine_memory" : "810179231744",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"zone" : "node-0"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-02-16T08:47:24.068Z], failed_attempts[5], delayed=false, details[failed shard on node [v-B8CJFjTqW2E1iXwZPCaA]: failed recovery, failure RecoveryFailedException[[docs_0_1581724852890][0]: Recovery failed from {elasticsearch-6}{wbr4ZGYJRnycM0u94urCNg}{xThCwa-XTW2lrwyCvHA-NQ}{AA.BB.MM.YY}{AA.BB.MM.YY:9300}{ml.machine_memory=810179231744, ml.max_open_jobs=20, xpack.installed=true, zone=node-0} into {elasticsearch-3}{v-B8CJFjTqW2E1iXwZPCaA}{Q6Ru0k27SD-do7wCfrxSWQ}{XX.ZZ.WW.JJ}{XX.ZZ.WW.JJ:9300}{ml.machine_memory=810191155200, xpack.installed=true, zone=node-2, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elasticsearch-6][AA.BB.MM.YY:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [14240635080/13.2gb], which is larger than the limit of [14214522470/13.2gb], real usage: [14239573248/13.2gb], new bytes reserved: [1061832/1mb]]; ], allocation_status[no_attempt]]]"
},
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[docs_0_1581724852890][0], node[wbr4ZGYJRnycM0u94urCNg], [P], s[STARTED], a[id=q5O1_UW4Sw2FQW6NALb87A]]"
},
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of outgoing shard recoveries [2] on the node [wbr4ZGYJRnycM0u94urCNg] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
},
{
"decider" : "awareness",
"decision" : "NO",
"explanation" : "there are too many copies of the shard allocated to nodes with attribute [zone], there are [2] total configured shard copies for this shard id and [5] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
}
]
}
Clarifications:
-
After successful initial re-route execution for failed shards and cluster tuning
GREEN
, would there be any reasons why shards could get again intoUNASSIGNED
state autonomously without any change in cluster ? -
Out of above deciders of
explain
API ,
-
max-retry
: This indicates the CB exception -CircuitBreakingException[[parent] Data too large, data for [<transport_request>]..
Xmx
andXms
for ES is14G
for each of 12 instances and we have not moved toG1GC
. Any additional data to be collected to decide if there is a memory starvation for each ES instance ? -
same_shard
: This indicates..the shard cannot be allocated to the same node on which a copy of the shard already exists..
What could be the reason for this after the initial reroute attempt was successful ? -
awareness
: As per aboveexplanation
output, is this related to below configuration which is used inelasticsearch.yml
? If so, could you please explain how below configuration negates allocation ?Snippet of
elasticsearch.yml
from ES instanceelasticsearch-1
on physical-nodenode-2
is as follows :
(Here,node-0
,node-1
andnode-2
represent the physical servers each running 4 instances of ES each)node.name: elasticsearch-1 cluster.initial_master_nodes: ["elasticsearch-0" , "elasticsearch-1" , "elasticsearch-2" , "elasticsearch-3" , "elasticsearch-4" , "elasticsearch-5" , "elasticsearch-6" , "elasticsearch-7" , "elasticsearch-8" , "elasticsearch-9" , "elasticsearch-10" , "elasticsearch-11"] cluster.routing.allocation.awareness.attributes: zone cluster.routing.allocation.awareness.force.zone.values: node-0,node-1,node-2 node.attr.zone: node-2
Thanks in advance
Muthu