Questions about index.allocation.max_retries?

Yesterday we had a cluster node status that turned yellow with an unsuccessful allocation of a shard of an index, when a colleague found the following exception in the logs:

[2021-03-16T17:34:24,934][WARN ][o.e.a.b.TransportShardBulkAction] [es-data-5] [[user_crowd_relation_2020.12.11][12]] failed to perform indices:data/write/bulk[s] on replica [user_crowd_relation_2020.12.11][12], node[jBofE-wzTZiQZcfVa70P6w], [R], s[STARTED], a[id=Y-XPYf_gSeO5Jae30tVD8Q]
org.elasticsearch.transport.RemoteTransportException: [es-data-7][172.18.141.41:18008][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [20411577964/19gb], which is larger than the limit of [20401094656/19gb], real usage: [20411225480/19gb], new bytes reserved: [352484/344.2kb], usages [request=0/0b, fielddata=10613/10.3kb, in_flight_requests=352484/344.2kb, accounting=825619336/787.3mb]
    at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:170) [elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:118) [elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
    at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
    at java.lang.Thread.run(Thread.java:830) [?:?]

Then he suspected that the cluster slice allocation reached the maximum number of times due to the resource problem, so he modified index.allocation.max_retries , adjusting the value to 109, which was 60 before, according to him, a colleague tested it before, this value is always accumulating without restarting and not resetting after successful allocation, so it needs But I don't think that should be the case, and I didn't find any explanation of this in the documentation, so I would like to ask if this value really keeps accumulating, and what is the real early reason for the cluster to turn yellow? Since it was fine after index.allocation.max_retries was turned up later, I didn't have time to perform an explain.

The version of elasticsearch is 7.4.2

The index in question has a total of 262GB of data, but there are 18 master slices, each master slice has a copy slice, a total of 36 slices, 4 data nodes, when the node becomes yellow, look at the exceptions and monitoring, there should be a program in the batch write data, I suspect that the index is growing when it needs to rebalance the slice, but is this problem really related to index. allocation.max_retries has something to do with it? Here is the result of my explain execution

{ - 
  "index": "user",
  "shard": 0,
  "primary": false,
  "current_state": "started",
  "current_node": { - 
    "id": "n_1VNjp0R7mecVYHMYTiDQ",
    "name": "es-data-8",
    "transport_address": "xxxx",
    "attributes": { - 
      "xpack.installed": "true"
    },
    "weight_ranking": 1
  },
  "can_remain_on_current_node": "yes",
  "can_rebalance_cluster": "yes",
  "can_rebalance_to_other_node": "no",
  "rebalance_explanation": "cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance",
  "node_allocation_decisions": [ - 
    { - 
      "node_id": "acOfOoZ_TCKl5dUci4HG5A",
      "node_name": "es-data-9",
      "transport_address": "xxxx",
      "node_attributes": { - 
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "weight_ranking": 1,
      "deciders": [ - 
        { - 
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[user_crowd_relation_2020.12.11][0], node[acOfOoZ_TCKl5dUci4HG5A], [P], s[STARTED], a[id=1QGQiqg3TLu1UMDnEmvtFw]]"
        }
      ]
    },
    { - 
      "node_id": "T0h6slW5QOeCg4gn4P_Iiw",
      "node_name": "es-data-5",
      "transport_address": "xxxx",
      "node_attributes": { - 
        "xpack.installed": "true"
      },
      "node_decision": "worse_balance",
      "weight_ranking": 1
    },
    { - 
      "node_id": "jBofE-wzTZiQZcfVa70P6w",
      "node_name": "es-data-7",
      "transport_address": "xxxx",
      "node_attributes": { - 
        "xpack.installed": "true"
      },
      "node_decision": "worse_balance",
      "weight_ranking": 1
    }
  ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.