The index shard is lost, and the cluster status is green

Elasticsearchversion (bin/Elasticsearch --version): Version: 7.3.1

Plugins installed:

JVM version (java -version): jdk 8

OS version : Linux

Description of the problem including expected versus actual behavior:
The Elasticsearch node's memory increases under high pressure query, and the JVM recovery rate is low.
The full GC isabout to be triggered, and the shard migration of the index occurs.
Because the memory fuse has been triggered, the shard migration fails.

[2021-10-27T05:30:09,194][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990669][317507] duration [833ms], collections [1]/[1s], total [833ms]/[8.3h], memory [41gb]->[40.9gb]/[73.9gb], all_pools {[young] [576.3mb]->[5.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [40.4gb]->[40.8gb]/[72.9gb]}
[2021-10-27T05:30:09,194][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990669] overhead, spent [833ms] collecting in the last [1s]
[2021-10-27T05:30:24,632][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990683][317527] duration [765ms], collections [1]/[1.1s], total [765ms]/[8.3h], memory [45.4gb]->[45.9gb]/[73.9gb], all_pools {[young] [25mb]->[538.8kb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [45.3gb]->[45.8gb]/[72.9gb]}
[2021-10-27T05:30:28,030][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990686] overhead, spent [981ms] collecting in the last [1.3s]
[2021-10-27T05:30:46,970][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990702] overhead, spent [1s] collecting in the last [1.4s]
[2021-10-27T05:30:48,050][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990703][317558] duration [797ms], collections [1]/[1s], total [797ms]/[8.3h], memory [54.6gb]->[54.9gb]/[73.9gb], all_pools {[young] [207.4mb]->[860.6kb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [54.3gb]->[54.8gb]/[72.9gb]}
[2021-10-27T05:30:48,050][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990703] overhead, spent [797ms] collecting in the last [1s]
[2021-10-27T05:30:51,257][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990706][317561] duration [789ms], collections [1]/[1s], total [789ms]/[8.3h], memory [55.9gb]->[56gb]/[73.9gb], all_pools {[young] [332.9mb]->[1mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [55.4gb]->[55.9gb]/[72.9gb]}
[2021-10-27T05:30:51,257][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990706] overhead, spent [789ms] collecting in the last [1s]
[2021-10-27T05:31:01,154][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990715] overhead, spent [989ms] collecting in the last [1.3s]
[2021-10-27T05:31:03,209][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990717][317576] duration [728ms], collections [1]/[1s], total [728ms]/[8.3h], memory [60.2gb]->[60.3gb]/[73.9gb], all_pools {[young] [260.4mb]->[1.1mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [59.8gb]->[60.2gb]/[72.9gb]}
[2021-10-27T05:31:17,294][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990729] overhead, spent [1s] collecting in the last [1.4s]
[2021-10-27T05:31:43,554][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990753][317628] duration [711ms], collections [1]/[1s], total [711ms]/[8.3h], memory [70.4gb]->[70.5gb]/[73.9gb], all_pools {[young] [805.4mb]->[606mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [69.5gb]->[69.8gb]/[72.9gb]}
[2021-10-27T05:31:43,554][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990753] overhead, spent [711ms] collecting in the last [1s]
[2021-10-27T05:31:44,560][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990754][317629] duration [702ms], collections [1]/[1s], total [702ms]/[8.3h], memory [70.5gb]->[70.5gb]/[73.9gb], all_pools {[young] [606mb]->[417.3mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [69.8gb]->[70gb]/[72.9gb]}
[2021-10-27T05:31:48,350][WARN ][o.e.i.c.IndicesClusterStateService] [node-0] [1002-fsm_project_info-fsm_project_info_es-20201202145037][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [1002-fsm_project_info-fsm_project_info_es-20201202145037][1]: Recovery failed from {node-1}{JAkuqfrnQhWAVZMYQ7r_dw}{WbkQBrO2RumkTipaWtufzw}{xxxxxx}{xxxxxx:9311}{dim} into {node-0}{NbnZ28_aTbOrYieZNVA3tA}{5YmF6geTQ4-HLChonWGBOQ}{xxxxxxx}{xxxxxx:9311}{dim}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.1.jar:7.3.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_272]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_272]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_272]
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-1][xxxxxxx:9311][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$23(RecoverySourceHandler.java:470) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.3.1.jar:7.3.1]
	... 7 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-0][xxxxxxx:9311][internal:index/shard/recovery/prepare_translog]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [76378064690/71.1gb], which is larger than the limit of [75418179993/70.2gb], real usage: [76378064376/71.1gb], new bytes reserved: [314/314b], usages [request=0/0b, fielddata=331758/323.9kb, in_flight_requests=2198/2.1kb, accounting=676895467/645.5mb]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1541) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1290) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1337) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	... 1 more
[2021-10-27T05:31:49,019][WARN ][o.e.i.c.IndicesClusterStateService] [node-0] [1002-fsm_project_info-fsm_project_info_es-20201202145037][1] marking and sending shard failed due to [failed recovery]

It is speculated that the index status verification failed due to memory fusing. After more than the default of 5 failures, the index will be marked as unavailable. However, at this time, the status of the cluster is green, which I think is unreasonable.
Also accompanied by a phenomenon is_ In the result returned by the /cat/shards interface, the status of shards is STARTED, but the number and size of shards are empty.

1002-fsm_project_info-fsm_project_info_es-20201202145037 1 p STARTED                 xxxxxx  node-0
1002-fsm_project_info-fsm_project_info_es-20211102548881 0 p STARTED 5950102 21.4g   xxxxxx  node-0

1002-fsm_project_info-fsm_project_info_es-20201202145037 is abnormal.
1002-fsm_project_info-fsm_project_info_es-20211102548881 is normal.

7.3 is EOL and no longer supported, please upgrade as a first step.

Is this a known problem?
I also encountered it in version 7.10.2.

What is the output from the _cluster/stats?pretty&human API?

The cluster status is green.
However, when querying all shards information, some shards are abnormal,just like

1002-fsm_project_info-fsm_project_info_es-20201202145037 1 p STARTED                 xxxxxx  node-0
1002-fsm_project_info-fsm_project_info_es-20211102548881 0 p STARTED 5950102 21.4g   xxxxxx  node-0

In the cluster log, there are logs of shard recovery failure of this index because of memory fusing at that time

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.