ALLOCATION_FAILED needed explain Elastic 8.1.1

INS · June 21, 2022, 4:11pm

Hi I'm facing with ALLOCATION_FAILED, and I want to know what was the reason. My cluster consists with 51 node on 3 hosts under docker swarm, and I have also configured data tier.
Can You explain how I can recover through dev tool unnasigned shards. It seems that only replicated shards have a problem.

{
  "note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "index" : "logstash-ebm-sgu-srv40990kab-b12-2022.06.18",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2022-06-18T15:16:33.007Z",
    "failed_allocation_attempts" : 5,
    "details" : """failed shard on node [eRk6tLc3RFG0zYOsnFPFUw]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-ebm-sgu-srv40990kab-b12-2022.06.18][0]: Recovery failed from {es_data_ssd_5_1}{XMyLrsVFSJiKSs-yUhZkgA}{-gwnp-iZTYKaMlpgb6zBFQ}{10.0.9.102}{10.0.9.102:9300}{hs}{rack_id=rack_one, xpack.installed=true} into {es_data_ssd_4_3}{eRk6tLc3RFG0zYOsnFPFUw}{4nWdvQE6QnCP3XEAcVj--Q}{10.0.9.146}{10.0.9.146:9300}{hs}{xpack.installed=true, rack_id=rack_three}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:816)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1349)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1349)
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:397)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:833)
Caused by: org.elasticsearch.transport.RemoteTransportException: [es_data_ssd_5_1][10.0.9.102:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [8513229092/7.9gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8513228144/7.9gb], new bytes reserved: [948/948b], usages [fielddata=2149465577/2gb, request=0/0b, inflight_requests=1298/1.2kb, model_inference=0/0b, eql_sequence=0/0b]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:440)
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:108)
	at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:215)
	at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:119)
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:147)
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:121)
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:86)
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:833)
""",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "awaiting_info",
  "allocate_explanation" : "cannot allocate because information about existing shard data is still being retrieved from some of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "1YxZtfHETGqHHxfbVI3lHQ",
      "node_name" : "es_data_ssd_1_2",
      "transport_address" : "10.0.9.117:9300",
      "node_attributes" : {
        "rack_id" : "rack_two",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "8RZ_g_qIQjSW7g_jgA0LRA",
      "node_name" : "es_data_ssd_3_3",
      "transport_address" : "10.0.9.143:9300",
      "node_attributes" : {
        "rack_id" : "rack_three",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "NPcmoyidSr2r4GiO07uimw",
      "node_name" : "es_data_ssd_4_2",
      "transport_address" : "10.0.9.128:9300",
      "node_attributes" : {
        "rack_id" : "rack_two",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "WqnNe05MTuuYzY8uwI_P7Q",
      "node_name" : "es_data_ssd_5_3",
      "transport_address" : "10.0.9.132:9300",
      "node_attributes" : {
        "rack_id" : "rack_three",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "YEzEpdGpT7iECEXqVTVhDQ",
      "node_name" : "es_data_ssd_2_3",
      "transport_address" : "10.0.9.136:9300",
      "node_attributes" : {
        "rack_id" : "rack_three",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "azCWBxpOTnC9Tj5vZSw1yw",
      "node_name" : "es_data_ssd_1_3",
      "transport_address" : "10.0.9.140:9300",
      "node_attributes" : {
        "rack_id" : "rack_three",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "eRk6tLc3RFG0zYOsnFPFUw",
      "node_name" : "es_data_ssd_4_3",
      "transport_address" : "10.0.9.146:9300",
      "node_attributes" : {
        "rack_id" : "rack_three",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "fjWTi2hUQxO3tEzT5zwGog",
      "node_name" : "es_data_ssd_3_2",
      "transport_address" : "10.0.9.115:9300",
      "node_attributes" : {
        "rack_id" : "rack_two",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "ifoM86KLRxarQqBVnAoa4A",
      "node_name" : "es_data_ssd_5_2",
      "transport_address" : "10.0.9.119:9300",
      "node_attributes" : {
        "rack_id" : "rack_two",
        "xpack.installed" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "wgCiK7OqTbG6XUPTtv-_gg",
      "node_name" : "es_data_ssd_2_2",
      "transport_address" : "10.0.9.118:9300",
      "node_attributes" : {
        "rack_id" : "rack_two",
        "xpack.installed" : "true"
      },

INS · June 21, 2022, 5:14pm

what I've already checked:

GC puase on that node with that timestamp
//I didn't find in my config jvm.options -> that parameter MaxGCPauseMillis
what is the default

################################################################
## Expert settings
################################################################
##
## All settings below here are considered expert settings. Do
## not adjust them unless you understand what you are doing. Do
## not edit them in this file; instead, create a new file in the
## jvm.options.d directory containing your adjustments.
##
################################################################

## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 8-13:-XX:-UseConcMarkSweepGC
# 8-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC

## JVM temporary directory
-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails; heap dumps
# are created in the working directory of the JVM unless an alternative path is
# specified
-XX:+HeapDumpOnOutOfMemoryError

# exit right after heap dump on out of memory error. Recommended to also use
# on java 8 for supported versions (8u92+).
9-:-XX:+ExitOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## GC logging
-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

check _nodes/stats for failed node

_nodes" : {
    "total" : 51,
    "successful" : 50,
    "failed" : 1,
    "failures" : [
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [XMyLrsVFSJiKSs-yUhZkgA]",
        "node_id" : "XMyLrsVFSJiKSs-yUhZkgA",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [cluster:monitor/nodes/stats[n]] would be [8243199768/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8243182528/7.6gb], new bytes reserved: [17240/16.8kb], usages [fielddata=2218043735/2gb, request=4358144/4.1mb, inflight_requests=17240/16.8kb, model_inference=0/0b, eql_sequence=0/0b]",
          "bytes_wanted" : 8243199768,
          "bytes_limit" : 8160437862,
          "durability" : "PERMANENT"
        }
      }

      "adaptive_selection" : {
        "XMyLrsVFSJiKSs-yUhZkgA" : {
          "outgoing_searches" : 0,
          "avg_queue_size" : 0,
          "avg_service_time_ns" : 31849365,
          "avg_response_time_ns" : 33660104,
          "rank" : "33.7"
        },

warkolm · June 21, 2022, 10:25pm

How much heap do your nodes have?

system · July 19, 2022, 10:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Allocation Failed Elasticsearch	7	1134	April 17, 2023
One shard continually fails to allocate Elasticsearch	9	500	July 6, 2017
ElasticSearchIllegalArgumentException[[allocate] trying to reallocate shard Elasticsearch	2	337	July 6, 2017
Allocation Error Elasticsearch	4	13088	June 19, 2017
Elasticsearch GC allocation failure errors in GC.log Elasticsearch	1	944	August 4, 2020

ALLOCATION_FAILED needed explain Elastic 8.1.1

Related topics