Elasticsearchversion (bin/Elasticsearch --version): Version: 7.3.1
Plugins installed:
JVM version (java -version): jdk 8
OS version : Linux
Description of the problem including expected versus actual behavior:
The Elasticsearch node's memory increases under high pressure query, and the JVM recovery rate is low.
The full GC isabout to be triggered, and the shard migration of the index occurs.
Because the memory fuse has been triggered, the shard migration fails.
[2021-10-27T05:30:09,194][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990669][317507] duration [833ms], collections [1]/[1s], total [833ms]/[8.3h], memory [41gb]->[40.9gb]/[73.9gb], all_pools {[young] [576.3mb]->[5.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [40.4gb]->[40.8gb]/[72.9gb]}
[2021-10-27T05:30:09,194][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990669] overhead, spent [833ms] collecting in the last [1s]
[2021-10-27T05:30:24,632][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990683][317527] duration [765ms], collections [1]/[1.1s], total [765ms]/[8.3h], memory [45.4gb]->[45.9gb]/[73.9gb], all_pools {[young] [25mb]->[538.8kb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [45.3gb]->[45.8gb]/[72.9gb]}
[2021-10-27T05:30:28,030][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990686] overhead, spent [981ms] collecting in the last [1.3s]
[2021-10-27T05:30:46,970][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990702] overhead, spent [1s] collecting in the last [1.4s]
[2021-10-27T05:30:48,050][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990703][317558] duration [797ms], collections [1]/[1s], total [797ms]/[8.3h], memory [54.6gb]->[54.9gb]/[73.9gb], all_pools {[young] [207.4mb]->[860.6kb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [54.3gb]->[54.8gb]/[72.9gb]}
[2021-10-27T05:30:48,050][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990703] overhead, spent [797ms] collecting in the last [1s]
[2021-10-27T05:30:51,257][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990706][317561] duration [789ms], collections [1]/[1s], total [789ms]/[8.3h], memory [55.9gb]->[56gb]/[73.9gb], all_pools {[young] [332.9mb]->[1mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [55.4gb]->[55.9gb]/[72.9gb]}
[2021-10-27T05:30:51,257][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990706] overhead, spent [789ms] collecting in the last [1s]
[2021-10-27T05:31:01,154][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990715] overhead, spent [989ms] collecting in the last [1.3s]
[2021-10-27T05:31:03,209][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990717][317576] duration [728ms], collections [1]/[1s], total [728ms]/[8.3h], memory [60.2gb]->[60.3gb]/[73.9gb], all_pools {[young] [260.4mb]->[1.1mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [59.8gb]->[60.2gb]/[72.9gb]}
[2021-10-27T05:31:17,294][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990729] overhead, spent [1s] collecting in the last [1.4s]
[2021-10-27T05:31:43,554][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990753][317628] duration [711ms], collections [1]/[1s], total [711ms]/[8.3h], memory [70.4gb]->[70.5gb]/[73.9gb], all_pools {[young] [805.4mb]->[606mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [69.5gb]->[69.8gb]/[72.9gb]}
[2021-10-27T05:31:43,554][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][3990753] overhead, spent [711ms] collecting in the last [1s]
[2021-10-27T05:31:44,560][INFO ][o.e.m.j.JvmGcMonitorService] [node-0] [gc][young][3990754][317629] duration [702ms], collections [1]/[1s], total [702ms]/[8.3h], memory [70.5gb]->[70.5gb]/[73.9gb], all_pools {[young] [606mb]->[417.3mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [69.8gb]->[70gb]/[72.9gb]}
[2021-10-27T05:31:48,350][WARN ][o.e.i.c.IndicesClusterStateService] [node-0] [1002-fsm_project_info-fsm_project_info_es-20201202145037][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [1002-fsm_project_info-fsm_project_info_es-20201202145037][1]: Recovery failed from {node-1}{JAkuqfrnQhWAVZMYQ7r_dw}{WbkQBrO2RumkTipaWtufzw}{xxxxxx}{xxxxxx:9311}{dim} into {node-0}{NbnZ28_aTbOrYieZNVA3tA}{5YmF6geTQ4-HLChonWGBOQ}{xxxxxxx}{xxxxxx:9311}{dim}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.1.jar:7.3.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_272]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_272]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_272]
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-1][xxxxxxx:9311][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$23(RecoverySourceHandler.java:470) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.3.1.jar:7.3.1]
... 7 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-0][xxxxxxx:9311][internal:index/shard/recovery/prepare_translog]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [76378064690/71.1gb], which is larger than the limit of [75418179993/70.2gb], real usage: [76378064376/71.1gb], new bytes reserved: [314/314b], usages [request=0/0b, fielddata=331758/323.9kb, in_flight_requests=2198/2.1kb, accounting=676895467/645.5mb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1541) ~[?:?]
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1290) ~[?:?]
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1337) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
... 1 more
[2021-10-27T05:31:49,019][WARN ][o.e.i.c.IndicesClusterStateService] [node-0] [1002-fsm_project_info-fsm_project_info_es-20201202145037][1] marking and sending shard failed due to [failed recovery]
It is speculated that the index status verification failed due to memory fusing. After more than the default of 5 failures, the index will be marked as unavailable. However, at this time, the status of the cluster is green, which I think is unreasonable.
Also accompanied by a phenomenon is_ In the result returned by the /cat/shards interface, the status of shards is STARTED, but the number and size of shards are empty.
1002-fsm_project_info-fsm_project_info_es-20201202145037 1 p STARTED xxxxxx node-0
1002-fsm_project_info-fsm_project_info_es-20211102548881 0 p STARTED 5950102 21.4g xxxxxx node-0
1002-fsm_project_info-fsm_project_info_es-20201202145037 is abnormal.
1002-fsm_project_info-fsm_project_info_es-20211102548881 is normal.