Hi team,
A few days ago, I encountered a circuite_breaking_exception that caused a node on my cluster to break. I increased my node's memory and did a rolling restart for my cluster. After that, everything worked normally again.
But when I checked the status of the shards, I discovered that there were a large number of unassigned replica shards. The "active_shards_percent_as_number" of my cluster only reached 84%. This number before restarting was ~63%.
I ran the command GET _cluster/allocation/explain
and got the response:
{
"index": "my-index-0011",
"shard": 0,
"primary": false,
"current_state": "unassigned",
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2024-07-02T03:06:16.503Z",
"failed_allocation_attempts": 5,
"details": """failed shard on node [eZXSVD4OR9Cxo-O8uBQCIQ]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [my-index-0011][0]: Recovery failed from {instance-0000000016}{e7qX1R92TUu_eayMWir9Ew}{6xsESxLHQi6bnHZcSjFMjA}{instance-0000000016}{10.44.1.29}{10.44.1.29:19560}{rw}{logical_availability_zone=zone-1, server_name=instance-0000000016.b93ad0bfba564e99864b092da16070c2, availability_zone=us-west1-c, xpack.installed=true, data=warm, instance_configuration=gcp.es.datawarm.n2.68x10x190, region=unknown-region} into {instance-0000000031}{eZXSVD4OR9Cxo-O8uBQCIQ}{zJhmDQ5sQZqn-mZvuGmr2w}{instance-0000000031}{10.44.0.5}{10.44.0.5:19787}{rw}{availability_zone=us-west1-c, logical_availability_zone=zone-0, xpack.installed=true, data=warm, server_name=instance-0000000031.b93ad0bfba564e99864b092da16070c2, instance_configuration=gcp.es.datawarm.n2.68x10x190, region=unknown-region}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:810)
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1350)
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1350)
at org.elasticsearch.transport.InboundHandler.doHandleException(InboundHandler.java:406)
at org.elasticsearch.transport.InboundHandler$3.doRun(InboundHandler.java:398)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.lang.Thread.run(Thread.java:833)
Caused by: org.elasticsearch.transport.RemoteTransportException: [instance-0000000016][172.17.0.16:19560][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [2117090596/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2117088784/1.9gb], new bytes reserved: [1812/1.7kb], usages [model_inference=0/0b, inflight_requests=970192/947.4kb, request=0/0b, fielddata=8142759/7.7mb, eql_sequence=0/0b]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:414)
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:109)
at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:215)
at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:119)
at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:147)
at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:121)
at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:86)
at org.elasticsearch.transport.netty4.Netty4MessageInboundHandler.channelRead(Netty4MessageInboundHandler.java:63)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1372)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1235)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1284)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:833)
""",
"last_allocation_status": "no_attempt"
},
"can_allocate": "yes",
"allocate_explanation": "Elasticsearch can allocate the shard.",
"target_node": {
"id": "eZXSVD4OR9Cxo-O8uBQCIQ",
"name": "instance-0000000031",
"transport_address": "10.44.0.5:19787",
"attributes": {
"logical_availability_zone": "zone-0",
"availability_zone": "us-west1-c",
"server_name": "instance-0000000031.b93ad0bfba564e99864b092da16070c2",
"xpack.installed": "true",
"data": "warm",
"instance_configuration": "gcp.es.datawarm.n2.68x10x190",
"region": "unknown-region"
}
},
"node_allocation_decisions": [
{
"node_id": "eZXSVD4OR9Cxo-O8uBQCIQ",
"node_name": "instance-0000000031",
"transport_address": "10.44.0.5:19787",
"node_attributes": {
"logical_availability_zone": "zone-0",
"availability_zone": "us-west1-c",
"server_name": "instance-0000000031.b93ad0bfba564e99864b092da16070c2",
"xpack.installed": "true",
"data": "warm",
"instance_configuration": "gcp.es.datawarm.n2.68x10x190",
"region": "unknown-region"
},
"node_decision": "yes",
"weight_ranking": 4
},
{
"node_id": "3KpOIzWjS6S-RMuCXu9ceA",
"node_name": "instance-0000000033",
"transport_address": "10.44.0.82:19147",
"node_attributes": {
"logical_availability_zone": "zone-2",
"availability_zone": "us-west1-a",
"server_name": "instance-0000000033.b93ad0bfba564e99864b092da16070c2",
"xpack.installed": "true",
"data": "warm",
"instance_configuration": "gcp.es.datawarm.n2.68x10x190",
"region": "unknown-region"
},
"node_decision": "yes",
"weight_ranking": 6
},
{
"node_id": "tNXrK1z6Tn--usSwCa-Psg",
"node_name": "instance-0000000027",
"transport_address": "10.44.1.237:19575",
"node_attributes": {
"logical_availability_zone": "zone-0",
"server_name": "instance-0000000027.b93ad0bfba564e99864b092da16070c2",
"availability_zone": "us-west1-c",
"xpack.installed": "true",
"data": "hot",
"instance_configuration": "gcp.es.datahot.n2.68x32x45",
"region": "unknown-region"
},
"node_decision": "no",
"weight_ranking": 1,
"deciders": [
{
"decider": "data_tier",
"decision": "NO",
"explanation": "index has a preference for tiers [data_warm,data_hot] and node does not meet the required [data_warm] tier"
}
]
},
{
"node_id": "NvgV7V0mQ86N0S8qw1FOqw",
"node_name": "instance-0000000029",
"transport_address": "10.44.1.236:19878",
"node_attributes": {
"logical_availability_zone": "zone-2",
"availability_zone": "us-west1-b",
"server_name": "instance-0000000029.b93ad0bfba564e99864b092da16070c2",
"xpack.installed": "true",
"data": "hot",
"instance_configuration": "gcp.es.datahot.n2.68x32x45",
"region": "unknown-region"
},
"node_decision": "no",
"weight_ranking": 2,
"deciders": [
{
"decider": "data_tier",
"decision": "NO",
"explanation": "index has a preference for tiers [data_warm,data_hot] and node does not meet the required [data_warm] tier"
}
]
},
{
"node_id": "X5yI0CppTeS-DE86eY06iQ",
"node_name": "instance-0000000030",
"transport_address": "10.44.0.137:19051",
"node_attributes": {
"logical_availability_zone": "zone-1",
"availability_zone": "us-west1-a",
"server_name": "instance-0000000030.b93ad0bfba564e99864b092da16070c2",
"xpack.installed": "true",
"data": "hot",
"instance_configuration": "gcp.es.datahot.n2.68x32x45",
"region": "unknown-region"
},
"node_decision": "no",
"weight_ranking": 3,
"deciders": [
{
"decider": "awareness",
"decision": "NO",
"explanation": "there are [3] copies of this shard and [3] values for attribute [logical_availability_zone] ([zone-0, zone-1, zone-2] from nodes in the cluster and no forced awareness) so there may be at most [1] copies of this shard allocated to nodes with each value, but (including this copy) there would be [2] copies allocated to nodes with [node.attr.logical_availability_zone: zone-1]"
},
{
"decider": "data_tier",
"decision": "NO",
"explanation": "index has a preference for tiers [data_warm,data_hot] and node does not meet the required [data_warm] tier"
}
]
},
{
"node_id": "A34RAzsLToqxza6F1s9hsg",
"node_name": "instance-0000000032",
"transport_address": "10.44.0.75:19204",
"node_attributes": {
"logical_availability_zone": "zone-1",
"server_name": "instance-0000000032.b93ad0bfba564e99864b092da16070c2",
"availability_zone": "us-west1-b",
"xpack.installed": "true",
"data": "warm",
"instance_configuration": "gcp.es.datawarm.n2.68x10x190",
"region": "unknown-region"
},
"node_decision": "no",
"weight_ranking": 5,
"deciders": [
{
"decider": "same_shard",
"decision": "NO",
"explanation": "a copy of this shard is already allocated to this node [[my-index-0011][0], node[A34RAzsLToqxza6F1s9hsg], [P], s[STARTED], a[id=1sdmGOCnQq6wdIEWNwosyw]]"
},
{
"decider": "awareness",
"decision": "NO",
"explanation": "there are [3] copies of this shard and [3] values for attribute [logical_availability_zone] ([zone-0, zone-1, zone-2] from nodes in the cluster and no forced awareness) so there may be at most [1] copies of this shard allocated to nodes with each value, but (including this copy) there would be [2] copies allocated to nodes with [node.attr.logical_availability_zone: zone-1]"
}
]
}
]
}
It's running on version 8.3.3
How can I resolve my issue? What is the root cause?
Regards,
Phu