Dear All,
I had posted the below issue while allocating replicas last week, and not got any reply for it yet.
Maybe my questions were too convoluted or had a formatting issue to get an answer..
So I have arranged the contents and posted again,
Please check this issue and give any feedback to solve it.
Cluster ENV
- es version : 6.5.2
- Cluster has total 6 nodes(one node per one instance) and the master node is "
node-01
" now.
node-01
:C-4CZFpTTR6dRZxrobsBfg
node-02
:sMqz3F_BTFGU1rDmHSXHpQ
node-03
:vvfQweLFTsq78uqmC_ATRQ
node-04
:thoqSv1qSo64Z4AjPMI5uQ
node-05
:E-uK6wdYS6yJa7BDbjWl-w
node-06
:TypY23nHSdeYtHaJhod6jQ
I have created a new index of only 1 shard(with no replica) and inserted about 600,000 docs through bulk api.
(The new index name is "es_sch_idx
" and allocated in "node-05
")
Now index size is 1.9GB and I have watied until all segments are commited.
And then I set replica to "5" and the master node starts to allocate the shard to all nodes through recovery action.
[2020-04-06T01:47:53,928][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node-01] updating number_of_replicas to [5] for indices [es_sch_idx]
The replica has been made into 2 nodes("node-02
", "node-03
") first since "cluster.routing.allocation.node_incoming/outgoing_recoveries" are the default value of 2.
The issue is happened when the recovery for first 2 replicas has done or the start time for the third replica.
The below logs are made too many and the master node sometimes hang out for some minutes.
[2020-04-06T01:49:29,982][WARN ][o.e.g.G.InternalReplicaShardAllocator] [node-01] [es_sch_idx][0]: failed to list shard for shard_store on node [sMqz3F_BTFGU1rDmHSXHpQ]
org.elasticsearch.action.FailedNodeException: Failed node [sMqz3F_BTFGU1rDmHSXHpQ]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:237) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.transport.TcpTransport.lambda$handleException$32(TcpTransport.java:1268) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) [elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1266) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1258) [elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1188) [elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:65) [transport-netty4-client-6.5.2.jar:6.5.2]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:426) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-02][10.113.132.11:1039][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[es_sch_idx][0]]
at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:113) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.2.jar:6.5.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
... 1 more
Caused by: java.io.FileNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/essch/data/nodes/0/indices/NRk9I_LsQ4aECjJrPryKxw/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@992f9f9)): files: [recovery.WhKUaGWERk-JYuHB1JpNow._0.cfe, recovery.WhKUaGWERk-JYuHB1JpNow._0.cfs, ... , recovery.WhKUaGWERk-JYuHB1JpNow._y.si, recovery.WhKUaGWERk-JYuHB1JpNow._y_1.liv, recovery.WhKUaGWERk-JYuHB1JpNow.segments_68, write.lock]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:131) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:201) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.index.store.Store.access$200(Store.java:129) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:851) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:784) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.index.store.Store.getMetadata(Store.java:287) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:1176) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:127) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:111) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.2.jar:6.5.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
... 1 more
... 1 more
FYI,
I have not found any problem when I set replica to "1" or "2".
This issue is happened when I set replicas "3" or more..
If you need more information, I can add it since this issue can be reproduced very easily..
Thanks!!