Question about the problem of “FileNotFoundException" while allocating replicas

Dear All,

I had posted the below issue while allocating replicas last week, and not got any reply for it yet.

Maybe my questions were too convoluted or had a formatting issue to get an answer..
So I have arranged the contents and posted again,
Please check this issue and give any feedback to solve it.

Cluster ENV

  • es version : 6.5.2
  • Cluster has total 6 nodes(one node per one instance) and the master node is "node-01" now.
    node-01 : C-4CZFpTTR6dRZxrobsBfg
    node-02 : sMqz3F_BTFGU1rDmHSXHpQ
    node-03 : vvfQweLFTsq78uqmC_ATRQ
    node-04 : thoqSv1qSo64Z4AjPMI5uQ
    node-05 : E-uK6wdYS6yJa7BDbjWl-w
    node-06 : TypY23nHSdeYtHaJhod6jQ

I have created a new index of only 1 shard(with no replica) and inserted about 600,000 docs through bulk api.
(The new index name is "es_sch_idx" and allocated in "node-05")
Now index size is 1.9GB and I have watied until all segments are commited.

And then I set replica to "5" and the master node starts to allocate the shard to all nodes through recovery action.

[2020-04-06T01:47:53,928][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node-01] updating number_of_replicas to [5] for indices [es_sch_idx]

The replica has been made into 2 nodes("node-02", "node-03") first since "cluster.routing.allocation.node_incoming/outgoing_recoveries" are the default value of 2.

The issue is happened when the recovery for first 2 replicas has done or the start time for the third replica.
The below logs are made too many and the master node sometimes hang out for some minutes.

[2020-04-06T01:49:29,982][WARN ][o.e.g.G.InternalReplicaShardAllocator] [node-01] [es_sch_idx][0]: failed to list shard for shard_store on node [sMqz3F_BTFGU1rDmHSXHpQ]
org.elasticsearch.action.FailedNodeException: Failed node [sMqz3F_BTFGU1rDmHSXHpQ]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:237) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport.lambda$handleException$32(TcpTransport.java:1268) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1266) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1258) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1188) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:65) [transport-netty4-client-6.5.2.jar:6.5.2]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:426) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	...
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-02][10.113.132.11:1039][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[es_sch_idx][0]]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:113) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.2.jar:6.5.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
	... 1 more
Caused by: java.io.FileNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/essch/data/nodes/0/indices/NRk9I_LsQ4aECjJrPryKxw/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@992f9f9)): files: [recovery.WhKUaGWERk-JYuHB1JpNow._0.cfe, recovery.WhKUaGWERk-JYuHB1JpNow._0.cfs, ... , recovery.WhKUaGWERk-JYuHB1JpNow._y.si, recovery.WhKUaGWERk-JYuHB1JpNow._y_1.liv, recovery.WhKUaGWERk-JYuHB1JpNow.segments_68, write.lock]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:131) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:201) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store.access$200(Store.java:129) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:851) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:784) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store.getMetadata(Store.java:287) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:1176) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:127) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:111) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
	... 1 more
	... 1 more

FYI,
I have not found any problem when I set replica to "1" or "2".
This issue is happened when I set replicas "3" or more..
If you need more information, I can add it since this issue can be reproduced very easily..

Thanks!!

I think this is https://github.com/elastic/elasticsearch/pull/42287, the warnings are actually harmless and you can get rid of them by upgrading.

Thank you very much for your reply!!

I have checked the patch you mentioned.
That seems to remove warning log and trigger trace log instead when lucene IndexNotFoundException.

I have one more question.
When that warning logs too many are shown, the master node sometimes hang out for some minutes.
Can the warning logs too many trigger hang out of elasticsearch process of the master node?

Thank you for your help!

It is unlikely.

Thank you for your information.
We are investigating further more for hang out of master node.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.