Questions about index.allocation.max_retries?

asasas234 · March 17, 2021, 3:19am

Yesterday we had a cluster node status that turned yellow with an unsuccessful allocation of a shard of an index, when a colleague found the following exception in the logs:

[2021-03-16T17:34:24,934][WARN ][o.e.a.b.TransportShardBulkAction] [es-data-5] [[user_crowd_relation_2020.12.11][12]] failed to perform indices:data/write/bulk[s] on replica [user_crowd_relation_2020.12.11][12], node[jBofE-wzTZiQZcfVa70P6w], [R], s[STARTED], a[id=Y-XPYf_gSeO5Jae30tVD8Q]
org.elasticsearch.transport.RemoteTransportException: [es-data-7][172.18.141.41:18008][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [20411577964/19gb], which is larger than the limit of [20401094656/19gb], real usage: [20411225480/19gb], new bytes reserved: [352484/344.2kb], usages [request=0/0b, fielddata=10613/10.3kb, in_flight_requests=352484/344.2kb, accounting=825619336/787.3mb]
    at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:170) [elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:118) [elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
    at java.lang.Thread.run(Thread.java:830) [?:?]

Then he suspected that the cluster slice allocation reached the maximum number of times due to the resource problem, so he modified index.allocation.max_retries , adjusting the value to 109, which was 60 before, according to him, a colleague tested it before, this value is always accumulating without restarting and not resetting after successful allocation, so it needs But I don't think that should be the case, and I didn't find any explanation of this in the documentation, so I would like to ask if this value really keeps accumulating, and what is the real early reason for the cluster to turn yellow? Since it was fine after index.allocation.max_retries was turned up later, I didn't have time to perform an explain.

asasas234 · March 17, 2021, 3:22am

The version of elasticsearch is 7.4.2

asasas234 · March 17, 2021, 5:39am

The index in question has a total of 262GB of data, but there are 18 master slices, each master slice has a copy slice, a total of 36 slices, 4 data nodes, when the node becomes yellow, look at the exceptions and monitoring, there should be a program in the batch write data, I suspect that the index is growing when it needs to rebalance the slice, but is this problem really related to index. allocation.max_retries has something to do with it? Here is the result of my explain execution

{ - 
  "index": "user",
  "shard": 0,
  "primary": false,
  "current_state": "started",
  "current_node": { - 
    "id": "n_1VNjp0R7mecVYHMYTiDQ",
    "name": "es-data-8",
    "transport_address": "xxxx",
    "attributes": { - 
      "xpack.installed": "true"
    },
    "weight_ranking": 1
  },
  "can_remain_on_current_node": "yes",
  "can_rebalance_cluster": "yes",
  "can_rebalance_to_other_node": "no",
  "rebalance_explanation": "cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance",
  "node_allocation_decisions": [ - 
    { - 
      "node_id": "acOfOoZ_TCKl5dUci4HG5A",
      "node_name": "es-data-9",
      "transport_address": "xxxx",
      "node_attributes": { - 
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "weight_ranking": 1,
      "deciders": [ - 
        { - 
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[user_crowd_relation_2020.12.11][0], node[acOfOoZ_TCKl5dUci4HG5A], [P], s[STARTED], a[id=1QGQiqg3TLu1UMDnEmvtFw]]"
        }
      ]
    },
    { - 
      "node_id": "T0h6slW5QOeCg4gn4P_Iiw",
      "node_name": "es-data-5",
      "transport_address": "xxxx",
      "node_attributes": { - 
        "xpack.installed": "true"
      },
      "node_decision": "worse_balance",
      "weight_ranking": 1
    },
    { - 
      "node_id": "jBofE-wzTZiQZcfVa70P6w",
      "node_name": "es-data-7",
      "transport_address": "xxxx",
      "node_attributes": { - 
        "xpack.installed": "true"
      },
      "node_decision": "worse_balance",
      "weight_ranking": 1
    }
  ]
}

system · April 14, 2021, 5:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard allocation says max retry but fails to allocate on retry_failed=true Elasticsearch	7	11174	April 2, 2019
How to use "index.allocation.max_retries" Elasticsearch	3	5223	May 16, 2017
Shrink API sets index.allocation.max.retries to 1? Elasticsearch	1	498	November 14, 2018
RemoteTransportException triggered by parent circuit breaker Elasticsearch	1	1003	August 23, 2017
Shard has exceeded the maximum number of retries Elasticsearch	13	11761	February 24, 2020

Questions about index.allocation.max_retries?

Related topics