Recovery failed

Hi team,

A few days ago, I encountered a circuite_breaking_exception that caused a node on my cluster to break. I increased my node's memory and did a rolling restart for my cluster. After that, everything worked normally again.

But when I checked the status of the shards, I discovered that there were a large number of unassigned replica shards. The "active_shards_percent_as_number" of my cluster only reached 84%. This number before restarting was ~63%.

I ran the command GET _cluster/allocation/explain and got the response:

{
  "index": "my-index-0011",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "ALLOCATION_FAILED",
    "at": "2024-07-02T03:06:16.503Z",
    "failed_allocation_attempts": 5,
    "details": """failed shard on node [eZXSVD4OR9Cxo-O8uBQCIQ]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [my-index-0011][0]: Recovery failed from {instance-0000000016}{e7qX1R92TUu_eayMWir9Ew}{6xsESxLHQi6bnHZcSjFMjA}{instance-0000000016}{10.44.1.29}{10.44.1.29:19560}{rw}{logical_availability_zone=zone-1, server_name=instance-0000000016.b93ad0bfba564e99864b092da16070c2, availability_zone=us-west1-c, xpack.installed=true, data=warm, instance_configuration=gcp.es.datawarm.n2.68x10x190, region=unknown-region} into {instance-0000000031}{eZXSVD4OR9Cxo-O8uBQCIQ}{zJhmDQ5sQZqn-mZvuGmr2w}{instance-0000000031}{10.44.0.5}{10.44.0.5:19787}{rw}{availability_zone=us-west1-c, logical_availability_zone=zone-0, xpack.installed=true, data=warm, server_name=instance-0000000031.b93ad0bfba564e99864b092da16070c2, instance_configuration=gcp.es.datawarm.n2.68x10x190, region=unknown-region}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:810)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1350)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1350)
	at org.elasticsearch.transport.InboundHandler.doHandleException(InboundHandler.java:406)
	at org.elasticsearch.transport.InboundHandler$3.doRun(InboundHandler.java:398)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:833)
Caused by: org.elasticsearch.transport.RemoteTransportException: [instance-0000000016][172.17.0.16:19560][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [2117090596/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2117088784/1.9gb], new bytes reserved: [1812/1.7kb], usages [model_inference=0/0b, inflight_requests=970192/947.4kb, request=0/0b, fielddata=8142759/7.7mb, eql_sequence=0/0b]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:414)
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:109)
	at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:215)
	at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:119)
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:147)
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:121)
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:86)
	at org.elasticsearch.transport.netty4.Netty4MessageInboundHandler.channelRead(Netty4MessageInboundHandler.java:63)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1372)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1235)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1284)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:833)
""",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "yes",
  "allocate_explanation": "Elasticsearch can allocate the shard.",
  "target_node": {
    "id": "eZXSVD4OR9Cxo-O8uBQCIQ",
    "name": "instance-0000000031",
    "transport_address": "10.44.0.5:19787",
    "attributes": {
      "logical_availability_zone": "zone-0",
      "availability_zone": "us-west1-c",
      "server_name": "instance-0000000031.b93ad0bfba564e99864b092da16070c2",
      "xpack.installed": "true",
      "data": "warm",
      "instance_configuration": "gcp.es.datawarm.n2.68x10x190",
      "region": "unknown-region"
    }
  },
  "node_allocation_decisions": [
    {
      "node_id": "eZXSVD4OR9Cxo-O8uBQCIQ",
      "node_name": "instance-0000000031",
      "transport_address": "10.44.0.5:19787",
      "node_attributes": {
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-west1-c",
        "server_name": "instance-0000000031.b93ad0bfba564e99864b092da16070c2",
        "xpack.installed": "true",
        "data": "warm",
        "instance_configuration": "gcp.es.datawarm.n2.68x10x190",
        "region": "unknown-region"
      },
      "node_decision": "yes",
      "weight_ranking": 4
    },
    {
      "node_id": "3KpOIzWjS6S-RMuCXu9ceA",
      "node_name": "instance-0000000033",
      "transport_address": "10.44.0.82:19147",
      "node_attributes": {
        "logical_availability_zone": "zone-2",
        "availability_zone": "us-west1-a",
        "server_name": "instance-0000000033.b93ad0bfba564e99864b092da16070c2",
        "xpack.installed": "true",
        "data": "warm",
        "instance_configuration": "gcp.es.datawarm.n2.68x10x190",
        "region": "unknown-region"
      },
      "node_decision": "yes",
      "weight_ranking": 6
    },
    {
      "node_id": "tNXrK1z6Tn--usSwCa-Psg",
      "node_name": "instance-0000000027",
      "transport_address": "10.44.1.237:19575",
      "node_attributes": {
        "logical_availability_zone": "zone-0",
        "server_name": "instance-0000000027.b93ad0bfba564e99864b092da16070c2",
        "availability_zone": "us-west1-c",
        "xpack.installed": "true",
        "data": "hot",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region"
      },
      "node_decision": "no",
      "weight_ranking": 1,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_warm,data_hot] and node does not meet the required [data_warm] tier"
        }
      ]
    },
    {
      "node_id": "NvgV7V0mQ86N0S8qw1FOqw",
      "node_name": "instance-0000000029",
      "transport_address": "10.44.1.236:19878",
      "node_attributes": {
        "logical_availability_zone": "zone-2",
        "availability_zone": "us-west1-b",
        "server_name": "instance-0000000029.b93ad0bfba564e99864b092da16070c2",
        "xpack.installed": "true",
        "data": "hot",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region"
      },
      "node_decision": "no",
      "weight_ranking": 2,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_warm,data_hot] and node does not meet the required [data_warm] tier"
        }
      ]
    },
    {
      "node_id": "X5yI0CppTeS-DE86eY06iQ",
      "node_name": "instance-0000000030",
      "transport_address": "10.44.0.137:19051",
      "node_attributes": {
        "logical_availability_zone": "zone-1",
        "availability_zone": "us-west1-a",
        "server_name": "instance-0000000030.b93ad0bfba564e99864b092da16070c2",
        "xpack.installed": "true",
        "data": "hot",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region"
      },
      "node_decision": "no",
      "weight_ranking": 3,
      "deciders": [
        {
          "decider": "awareness",
          "decision": "NO",
          "explanation": "there are [3] copies of this shard and [3] values for attribute [logical_availability_zone] ([zone-0, zone-1, zone-2] from nodes in the cluster and no forced awareness) so there may be at most [1] copies of this shard allocated to nodes with each value, but (including this copy) there would be [2] copies allocated to nodes with [node.attr.logical_availability_zone: zone-1]"
        },
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_warm,data_hot] and node does not meet the required [data_warm] tier"
        }
      ]
    },
    {
      "node_id": "A34RAzsLToqxza6F1s9hsg",
      "node_name": "instance-0000000032",
      "transport_address": "10.44.0.75:19204",
      "node_attributes": {
        "logical_availability_zone": "zone-1",
        "server_name": "instance-0000000032.b93ad0bfba564e99864b092da16070c2",
        "availability_zone": "us-west1-b",
        "xpack.installed": "true",
        "data": "warm",
        "instance_configuration": "gcp.es.datawarm.n2.68x10x190",
        "region": "unknown-region"
      },
      "node_decision": "no",
      "weight_ranking": 5,
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[my-index-0011][0], node[A34RAzsLToqxza6F1s9hsg], [P], s[STARTED], a[id=1sdmGOCnQq6wdIEWNwosyw]]"
        },
        {
          "decider": "awareness",
          "decision": "NO",
          "explanation": "there are [3] copies of this shard and [3] values for attribute [logical_availability_zone] ([zone-0, zone-1, zone-2] from nodes in the cluster and no forced awareness) so there may be at most [1] copies of this shard allocated to nodes with each value, but (including this copy) there would be [2] copies allocated to nodes with [node.attr.logical_availability_zone: zone-1]"
        }
      ]
    }
  ]
}

It's running on version 8.3.3
How can I resolve my issue? What is the root cause?

Regards,
Phu

Hi @Phu_Van_Nguyen,

Welcome! Can you share the memory settings that you changed to? It looks like you are still encountering a circuit breaker exception in the recovery stage due to the data being too large:

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [2117090596/1.9gb], which is larger than the limit of [2040109465/1.8gb]

@carly.richmond I've set 8GB for RAM (4GB for Heap). This number was previously 4GB for RAM (2GB for Heap).