ES 7.0.1 : Unassigned Shards : Clarifications on how reroute API with retry_failed parameter works and its side-effects

ES Version : 7.0.1

Cluster Details:
3 physical nodes each running 4 instances of ES - total 12 instances of Cluster

Problem description:
As part of testing, performed a full cluster-reboot, noticed that some shards (please see cluster-health output) reached UNASSIGNED state and cluster-status became RED and remained
in that state for very long time - 6-7 hours.

Cluster Health as seen during the initial problem-stage
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 12,
"number_of_data_nodes" : 12,
"active_primary_shards" : 814,
"active_shards" : 1618,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 22,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.65853658536585
}

Hence, attempted reroute (POST "/_cluster/reroute?retry_failed=true") to recover the long-failing shards. After a couple of hours of RELOCATING of some shards, cluster-status
turned GREEN

And the cluster was left in that stable-state undisturbed. After 10-12 hrs while checking the cluster-health again, randomly some shards went into UNASSIGNED state and cluster state turned RED. Examined if there were any reboots of any instances of ES and no such conditions observed.

When explain API was run at this juncture, noticed following cluster-health and explain API outputs

Cluster Health:
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 12,
"number_of_data_nodes" : 12,
"active_primary_shards" : 809,
"active_shards" : 1605,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 25,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.46625766871165
}

Corresponding Explain API output

Single Sample snippet is as follows (all unassigned shards permutations of below negative deciders as reason for UNASSIGNED state):

{
  "node_id" : "wbr4ZGYJRnycM0u94urCNg",
  "node_name" : "elasticsearch-6",
  "transport_address" : "AA.BB.MM.YY:9300",
  "node_attributes" : {
    "ml.machine_memory" : "810179231744",
    "ml.max_open_jobs" : "20",
    "xpack.installed" : "true",
    "zone" : "node-0"
  },
  "node_decision" : "no",
  "deciders" : [
    {
      "decider" : "max_retry",
      "decision" : "NO",
      "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-02-16T08:47:24.068Z], failed_attempts[5], delayed=false, details[failed shard on node [v-B8CJFjTqW2E1iXwZPCaA]: failed recovery, failure RecoveryFailedException[[docs_0_1581724852890][0]: Recovery failed from {elasticsearch-6}{wbr4ZGYJRnycM0u94urCNg}{xThCwa-XTW2lrwyCvHA-NQ}{AA.BB.MM.YY}{AA.BB.MM.YY:9300}{ml.machine_memory=810179231744, ml.max_open_jobs=20, xpack.installed=true, zone=node-0} into {elasticsearch-3}{v-B8CJFjTqW2E1iXwZPCaA}{Q6Ru0k27SD-do7wCfrxSWQ}{XX.ZZ.WW.JJ}{XX.ZZ.WW.JJ:9300}{ml.machine_memory=810191155200, xpack.installed=true, zone=node-2, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elasticsearch-6][AA.BB.MM.YY:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [14240635080/13.2gb], which is larger than the limit of [14214522470/13.2gb], real usage: [14239573248/13.2gb], new bytes reserved: [1061832/1mb]]; ], allocation_status[no_attempt]]]"
    },
    {
      "decider" : "same_shard",
      "decision" : "NO",
      "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[docs_0_1581724852890][0], node[wbr4ZGYJRnycM0u94urCNg], [P], s[STARTED], a[id=q5O1_UW4Sw2FQW6NALb87A]]"
    },
    {
      "decider" : "throttling",
      "decision" : "THROTTLE",
      "explanation" : "reached the limit of outgoing shard recoveries [2] on the node [wbr4ZGYJRnycM0u94urCNg] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
    },
    {
      "decider" : "awareness",
      "decision" : "NO",
      "explanation" : "there are too many copies of the shard allocated to nodes with attribute [zone], there are [2] total configured shard copies for this shard id and [5] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
    }
  ]
}

Clarifications:

  1. After successful initial re-route execution for failed shards and cluster tuning GREEN, would there be any reasons why shards could get again into UNASSIGNED state autonomously without any change in cluster ?

  2. Out of above deciders of explain API ,

  • max-retry : This indicates the CB exception - CircuitBreakingException[[parent] Data too large, data for [<transport_request>]..
    Xmx and Xms for ES is 14G for each of 12 instances and we have not moved to G1GC. Any additional data to be collected to decide if there is a memory starvation for each ES instance ?

  • same_shard : This indicates ..the shard cannot be allocated to the same node on which a copy of the shard already exists.. What could be the reason for this after the initial reroute attempt was successful ?

  • awareness : As per above explanation output, is this related to below configuration which is used in elasticsearch.yml ? If so, could you please explain how below configuration negates allocation ?

    Snippet of elasticsearch.yml from ES instance elasticsearch-1 on physical-node node-2 is as follows :
    (Here, node-0, node-1 and node-2 represent the physical servers each running 4 instances of ES each)

         node.name: elasticsearch-1
         cluster.initial_master_nodes: ["elasticsearch-0" , "elasticsearch-1" , "elasticsearch-2" , "elasticsearch-3" , "elasticsearch-4" , "elasticsearch-5" , "elasticsearch-6" , "elasticsearch-7" , "elasticsearch-8" , "elasticsearch-9" , "elasticsearch-10" , "elasticsearch-11"]
         cluster.routing.allocation.awareness.attributes: zone
         cluster.routing.allocation.awareness.force.zone.values: node-0,node-1,node-2
         node.attr.zone:  node-2
    

Thanks in advance
Muthu

Yes, the details of shard/node failures are logged by default. Check your server logs, particularly on the elected master node and the affected data node.

Your heap was ~95% full (13.2GB of 14GB used) so the recovery was stopped to protect the node from going completely out-of-memory and crashing. This happened repeatedly and Elasticsearch stops retrying after 5 attempts.

1 Like

Thanks for a quick reply @DavidTurner

Will check the logs of master and data nodes

What could be the reason for same_shard and awareness deciders kicking-in and denying the allocation for some of the shards ?

Elasticsearch won't allocate this shard to node elasticsearch-6 since that node already has a copy of this shard. That sounds correct to me. I expect there will be at least one node on which these deciders are not blocking allocation.

@DavidTurner, tried bumping the heap from 14GB to 20GB. Looks like the limit also moves along. From logs, I see following

2020-02-20T08:52:30,985][WARN ][o.e.a.s.TransportClearScrollAction] [elasticsearch-0] Clear SC failed on node[{elasticsearch-2}{35VghebSSsasUjGCN2Zs9A}{Qk77To2xQKCsfzSe-iVzMg}{10.60.0.211}{10.60.0.211:9300}{ml.machine_memory=810191155200, ml.max_open_jobs=20, xpack.installed=true, zone=node-2}
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-2][10.60.0.211:9300][indices:data/read/search[free_context/scroll]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [22257583276/20.7gb], which is larger than the limit of [21848994611/20.3gb], real usage: [22257583120/20.7gb], new bytes reserved: [156/156b]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1026) [elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:922) [elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:753) [elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) [transport-netty4-client-7.0.1.jar:7.0.1]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:835) [?:?] 

Since we did not want to introduce too many variables, I did not touch any other settings (eg. cluster.routing.allocation.node_concurrent_recoveries). Because I suspect increasing concurrent recoveries without fixing CircuitBreakingException could only aggravate memory requirements and I would have a baseline heap + non-heap usage for the scale which is handled.

Had few clarifications in this regard

  1. Not sure GC type used should be switched from CMS to G1GC : will G1GC yield any better way(s) to avoid CircuitBreakingException ? Happened to see another long-thread which did not conclude if G1GC is a concrete workaround -

(moving with higher JDK versions is certainly not an option for me - at least in near future, and we are at OpenJDK 1.8.0)

  1. Would any specific data be collected from node-stats (the output dump was too big and hence not pasting here) ? Any specific sections / snippets to be examined ?

Hi @Muthukumaran_Kothand,

I think there are a few details to look into. The first thing that springs to mind is this:

That would require at least configuring the allocated processors using the processors setting and the JDK GC thread settings, since otherwise all 4 ES nodes think they own the full machine (AFAIR, JDK 1.8 does not auto configure relative to cgroups setting). ES and the JVM will otherwise use way too many threads compared to what the physical node can handle, which increases the risk of hitting the circuit breaker. Preferably you should also setup docker/cgroups to control the disturbance between the nodes (cpu and memory).

Checking that swapping is disabled and that you have plenty RAM available for 4 * heap + overhead + file caching is also good checks to do.

The next thing I would double check is that the node is really running with CMS and has not for some reason fallen back to the JDK 8 default of Parallel GC (which will provoke the circuit breaker frequently).

There is no reason to go from CMS to G1. In particular we only recommend using G1 from JDK 10 and onwards. The discuss issues linked above on G1 mainly focused on an issue when using G1, not when using CMS.

If above is not enough, I would advice to look into the GC log and the workload when this happens (high concurrent load or maybe a single search causing issues). GC logging should be enabled by default, you can use _tasks and _nodes/hot_threads APIs to investigate the workload more.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.