The index shard is lost, and the cluster status is green

zekaifeng · November 27, 2021, 10:58am

Elasticsearchversion (bin/Elasticsearch --version): Version: 7.3.1

Plugins installed:

JVM version (java -version): jdk 8

OS version : Linux

Description of the problem including expected versus actual behavior:
The Elasticsearch node's memory increases under high pressure query, and the JVM recovery rate is low.
The full GC isabout to be triggered, and the shard migration of the index occurs.
Because the memory fuse has been triggered, the shard migration fails.

[2021-10-27T05:30:09,194][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990669][317507] duration [833ms], collections [1]/[1s], total [833ms]/[8.3h], memory [41gb]->[40.9gb]/[73.9gb], all_pools {[young] [576.3mb]->[5.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [40.4gb]->[40.8gb]/[72.9gb]}
[2021-10-27T05:30:09,194][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990669] overhead, spent [833ms] collecting in the last [1s]
[2021-10-27T05:30:24,632][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990683][317527] duration [765ms], collections [1]/[1.1s], total [765ms]/[8.3h], memory [45.4gb]->[45.9gb]/[73.9gb], all_pools {[young] [25mb]->[538.8kb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [45.3gb]->[45.8gb]/[72.9gb]}
[2021-10-27T05:30:28,030][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990686] overhead, spent [981ms] collecting in the last [1.3s]
[2021-10-27T05:30:46,970][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990702] overhead, spent [1s] collecting in the last [1.4s]
[2021-10-27T05:30:48,050][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990703][317558] duration [797ms], collections [1]/[1s], total [797ms]/[8.3h], memory [54.6gb]->[54.9gb]/[73.9gb], all_pools {[young] [207.4mb]->[860.6kb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [54.3gb]->[54.8gb]/[72.9gb]}
[2021-10-27T05:30:48,050][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990703] overhead, spent [797ms] collecting in the last [1s]
[2021-10-27T05:30:51,257][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990706][317561] duration [789ms], collections [1]/[1s], total [789ms]/[8.3h], memory [55.9gb]->[56gb]/[73.9gb], all_pools {[young] [332.9mb]->[1mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [55.4gb]->[55.9gb]/[72.9gb]}
[2021-10-27T05:30:51,257][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990706] overhead, spent [789ms] collecting in the last [1s]
[2021-10-27T05:31:01,154][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990715] overhead, spent [989ms] collecting in the last [1.3s]
[2021-10-27T05:31:03,209][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990717][317576] duration [728ms], collections [1]/[1s], total [728ms]/[8.3h], memory [60.2gb]->[60.3gb]/[73.9gb], all_pools {[young] [260.4mb]->[1.1mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [59.8gb]->[60.2gb]/[72.9gb]}
[2021-10-27T05:31:17,294][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990729] overhead, spent [1s] collecting in the last [1.4s]
[2021-10-27T05:31:43,554][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990753][317628] duration [711ms], collections [1]/[1s], total [711ms]/[8.3h], memory [70.4gb]->[70.5gb]/[73.9gb], all_pools {[young] [805.4mb]->[606mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [69.5gb]->[69.8gb]/[72.9gb]}
[2021-10-27T05:31:43,554][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990753] overhead, spent [711ms] collecting in the last [1s]
[2021-10-27T05:31:44,560][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990754][317629] duration [702ms], collections [1]/[1s], total [702ms]/[8.3h], memory [70.5gb]->[70.5gb]/[73.9gb], all_pools {[young] [606mb]->[417.3mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [69.8gb]->[70gb]/[72.9gb]}
[2021-10-27T05:31:48,350][WARN ][o.e.i.c.IndicesClusterStateService] [node-0] [1002-fsm_project_info-fsm_project_info_es-20201202145037][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [1002-fsm_project_info-fsm_project_info_es-20201202145037][1]: Recovery failed from {node-1}{JAkuqfrnQhWAVZMYQ7r_dw}{WbkQBrO2RumkTipaWtufzw}{xxxxxx}{xxxxxx:9311}{dim} into {node-0}{NbnZ28_aTbOrYieZNVA3tA}{5YmF6geTQ4-HLChonWGBOQ}{xxxxxxx}{xxxxxx:9311}{dim}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.1.jar:7.3.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_272]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_272]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_272]
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-1][xxxxxxx:9311][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$23(RecoverySourceHandler.java:470) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.3.1.jar:7.3.1]
	... 7 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-0][xxxxxxx:9311][internal:index/shard/recovery/prepare_translog]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [76378064690/71.1gb], which is larger than the limit of [75418179993/70.2gb], real usage: [76378064376/71.1gb], new bytes reserved: [314/314b], usages [request=0/0b, fielddata=331758/323.9kb, in_flight_requests=2198/2.1kb, accounting=676895467/645.5mb]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.1.jar:7.3.1]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1541) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1290) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1337) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	... 1 more
[2021-10-27T05:31:49,019][WARN ][o.e.i.c.IndicesClusterStateService] [node-0] [1002-fsm_project_info-fsm_project_info_es-20201202145037][1] marking and sending shard failed due to [failed recovery]

It is speculated that the index status verification failed due to memory fusing. After more than the default of 5 failures, the index will be marked as unavailable. However, at this time, the status of the cluster is green, which I think is unreasonable.
Also accompanied by a phenomenon is_ In the result returned by the /cat/shards interface, the status of shards is STARTED, but the number and size of shards are empty.

1002-fsm_project_info-fsm_project_info_es-20201202145037 1 p STARTED                 xxxxxx  node-0
1002-fsm_project_info-fsm_project_info_es-20211102548881 0 p STARTED 5950102 21.4g   xxxxxx  node-0

1002-fsm_project_info-fsm_project_info_es-20201202145037 is abnormal.
1002-fsm_project_info-fsm_project_info_es-20211102548881 is normal.

warkolm · November 28, 2021, 9:30pm

7.3 is EOL and no longer supported, please upgrade as a first step.

zekaifeng · November 30, 2021, 1:34am

Is this a known problem?
I also encountered it in version 7.10.2.

warkolm · November 30, 2021, 1:46am

What is the output from the _cluster/stats?pretty&human API?

zekaifeng · November 30, 2021, 1:57am

The cluster status is green.
However, when querying all shards information, some shards are abnormal,just like

1002-fsm_project_info-fsm_project_info_es-20201202145037 1 p STARTED                 xxxxxx  node-0
1002-fsm_project_info-fsm_project_info_es-20211102548881 0 p STARTED 5950102 21.4g   xxxxxx  node-0

In the cluster log, there are logs of shard recovery failure of this index because of memory fusing at that time

system · December 28, 2021, 1:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to index new documents after cluster restart Elasticsearch	3	737	July 6, 2017
Memory problems during data index Elasticsearch	13	1562	July 6, 2017
ES Ate My Shards/Indexes too Elasticsearch	9	465	July 6, 2017
ES Ate My Shards/Indexes Elasticsearch	13	574	July 6, 2017
Cluster Failure Elasticsearch	2	261	July 6, 2017

The index shard is lost, and the cluster status is green

Related topics