ES Version : 7.0.1
Cluster Details:
3 physical nodes each running 4 instances of ES - total 12 instances of Cluster
Problem description:
As part of testing, performed a full cluster-reboot, noticed that some shards (please see cluster-health output) reached UNASSIGNED state and cluster-status became RED and remained
in that state for very long time - 6-7 hours.
Cluster Health as seen during the initial problem-stage
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 12,
"number_of_data_nodes" : 12,
"active_primary_shards" : 814,
"active_shards" : 1618,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 22,
 "delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.65853658536585
}
Hence, attempted reroute (POST "/_cluster/reroute?retry_failed=true") to recover the long-failing shards. After a couple of hours of RELOCATING of some shards, cluster-status
turned GREEN
And the cluster was left in that stable-state undisturbed. After 10-12 hrs while checking the cluster-health again, randomly some shards went into UNASSIGNED state and cluster state turned RED. Examined if there were any reboots of any instances of ES and no such conditions observed.
When explain API was run at this juncture, noticed following cluster-health and explain API outputs
Cluster Health:
{
 "cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 12,
 "number_of_data_nodes" : 12,
 "active_primary_shards" : 809,
"active_shards" : 1605,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 25,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.46625766871165
}
Corresponding Explain API output
Single Sample snippet is as follows (all unassigned shards permutations of below negative deciders as reason for UNASSIGNED state):
{
  "node_id" : "wbr4ZGYJRnycM0u94urCNg",
  "node_name" : "elasticsearch-6",
  "transport_address" : "AA.BB.MM.YY:9300",
  "node_attributes" : {
    "ml.machine_memory" : "810179231744",
    "ml.max_open_jobs" : "20",
    "xpack.installed" : "true",
    "zone" : "node-0"
  },
  "node_decision" : "no",
  "deciders" : [
    {
      "decider" : "max_retry",
      "decision" : "NO",
      "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-02-16T08:47:24.068Z], failed_attempts[5], delayed=false, details[failed shard on node [v-B8CJFjTqW2E1iXwZPCaA]: failed recovery, failure RecoveryFailedException[[docs_0_1581724852890][0]: Recovery failed from {elasticsearch-6}{wbr4ZGYJRnycM0u94urCNg}{xThCwa-XTW2lrwyCvHA-NQ}{AA.BB.MM.YY}{AA.BB.MM.YY:9300}{ml.machine_memory=810179231744, ml.max_open_jobs=20, xpack.installed=true, zone=node-0} into {elasticsearch-3}{v-B8CJFjTqW2E1iXwZPCaA}{Q6Ru0k27SD-do7wCfrxSWQ}{XX.ZZ.WW.JJ}{XX.ZZ.WW.JJ:9300}{ml.machine_memory=810191155200, xpack.installed=true, zone=node-2, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elasticsearch-6][AA.BB.MM.YY:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [14240635080/13.2gb], which is larger than the limit of [14214522470/13.2gb], real usage: [14239573248/13.2gb], new bytes reserved: [1061832/1mb]]; ], allocation_status[no_attempt]]]"
    },
    {
      "decider" : "same_shard",
      "decision" : "NO",
      "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[docs_0_1581724852890][0], node[wbr4ZGYJRnycM0u94urCNg], [P], s[STARTED], a[id=q5O1_UW4Sw2FQW6NALb87A]]"
    },
    {
      "decider" : "throttling",
      "decision" : "THROTTLE",
      "explanation" : "reached the limit of outgoing shard recoveries [2] on the node [wbr4ZGYJRnycM0u94urCNg] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
    },
    {
      "decider" : "awareness",
      "decision" : "NO",
      "explanation" : "there are too many copies of the shard allocated to nodes with attribute [zone], there are [2] total configured shard copies for this shard id and [5] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
    }
  ]
}
Clarifications:
- 
After successful initial re-route execution for failed shards and cluster tuning GREEN, would there be any reasons why shards could get again intoUNASSIGNEDstate autonomously without any change in cluster ?
- 
Out of above deciders of explainAPI ,
- 
max-retry: This indicates the CB exception -CircuitBreakingException[[parent] Data too large, data for [<transport_request>]..
 XmxandXmsfor ES is14Gfor each of 12 instances and we have not moved toG1GC. Any additional data to be collected to decide if there is a memory starvation for each ES instance ?
- 
same_shard: This indicates..the shard cannot be allocated to the same node on which a copy of the shard already exists..What could be the reason for this after the initial reroute attempt was successful ?
- 
awareness: As per aboveexplanationoutput, is this related to below configuration which is used inelasticsearch.yml? If so, could you please explain how below configuration negates allocation ?Snippet of elasticsearch.ymlfrom ES instanceelasticsearch-1on physical-nodenode-2is as follows :
 (Here,node-0,node-1andnode-2represent the physical servers each running 4 instances of ES each)node.name: elasticsearch-1 cluster.initial_master_nodes: ["elasticsearch-0" , "elasticsearch-1" , "elasticsearch-2" , "elasticsearch-3" , "elasticsearch-4" , "elasticsearch-5" , "elasticsearch-6" , "elasticsearch-7" , "elasticsearch-8" , "elasticsearch-9" , "elasticsearch-10" , "elasticsearch-11"] cluster.routing.allocation.awareness.attributes: zone cluster.routing.allocation.awareness.force.zone.values: node-0,node-1,node-2 node.attr.zone: node-2
Thanks in advance
Muthu