Question about the problem of “FileNotFoundException" while allocating replicas

Hyunsoo_Shim · April 6, 2020, 2:40am

Dear All,

I had posted the below issue while allocating replicas last week, and not got any reply for it yet.

Maybe my questions were too convoluted or had a formatting issue to get an answer..
So I have arranged the contents and posted again,
Please check this issue and give any feedback to solve it.

Cluster ENV

es version : 6.5.2
Cluster has total 6 nodes(one node per one instance) and the master node is "node-01" now.
node-01 : C-4CZFpTTR6dRZxrobsBfg
node-02 : sMqz3F_BTFGU1rDmHSXHpQ
node-03 : vvfQweLFTsq78uqmC_ATRQ
node-04 : thoqSv1qSo64Z4AjPMI5uQ
node-05 : E-uK6wdYS6yJa7BDbjWl-w
node-06 : TypY23nHSdeYtHaJhod6jQ

I have created a new index of only 1 shard(with no replica) and inserted about 600,000 docs through bulk api.
(The new index name is "es_sch_idx" and allocated in "node-05")
Now index size is 1.9GB and I have watied until all segments are commited.

And then I set replica to "5" and the master node starts to allocate the shard to all nodes through recovery action.

[2020-04-06T01:47:53,928][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node-01] updating number_of_replicas to [5] for indices [es_sch_idx]

The replica has been made into 2 nodes("node-02", "node-03") first since "cluster.routing.allocation.node_incoming/outgoing_recoveries" are the default value of 2.

The issue is happened when the recovery for first 2 replicas has done or the start time for the third replica.
The below logs are made too many and the master node sometimes hang out for some minutes.

[2020-04-06T01:49:29,982][WARN ][o.e.g.G.InternalReplicaShardAllocator] [node-01] [es_sch_idx][0]: failed to list shard for shard_store on node [sMqz3F_BTFGU1rDmHSXHpQ]
org.elasticsearch.action.FailedNodeException: Failed node [sMqz3F_BTFGU1rDmHSXHpQ]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:237) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport.lambda$handleException$32(TcpTransport.java:1268) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1266) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1258) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1188) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:65) [transport-netty4-client-6.5.2.jar:6.5.2]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:426) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.30.Final.jar:4.1.30.Final]
	...
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-02][10.113.132.11:1039][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[es_sch_idx][0]]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:113) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.2.jar:6.5.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
	... 1 more
Caused by: java.io.FileNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/essch/data/nodes/0/indices/NRk9I_LsQ4aECjJrPryKxw/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@992f9f9)): files: [recovery.WhKUaGWERk-JYuHB1JpNow._0.cfe, recovery.WhKUaGWERk-JYuHB1JpNow._0.cfs, ... , recovery.WhKUaGWERk-JYuHB1JpNow._y.si, recovery.WhKUaGWERk-JYuHB1JpNow._y_1.liv, recovery.WhKUaGWERk-JYuHB1JpNow.segments_68, write.lock]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:131) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:201) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store.access$200(Store.java:129) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:851) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:784) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.store.Store.getMetadata(Store.java:287) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:1176) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:127) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:111) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.2.jar:6.5.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
	... 1 more
	... 1 more

FYI,
I have not found any problem when I set replica to "1" or "2".
This issue is happened when I set replicas "3" or more..
If you need more information, I can add it since this issue can be reproduced very easily..

Thanks!!

DavidTurner · April 6, 2020, 8:06am

I think this is https://github.com/elastic/elasticsearch/pull/42287, the warnings are actually harmless and you can get rid of them by upgrading.

Hyunsoo_Shim · April 6, 2020, 11:01am

Thank you very much for your reply!!

I have checked the patch you mentioned.
That seems to remove warning log and trigger trace log instead when lucene IndexNotFoundException.

I have one more question.
When that warning logs too many are shown, the master node sometimes hang out for some minutes.
Can the warning logs too many trigger hang out of elasticsearch process of the master node?

Thank you for your help!

DavidTurner · April 6, 2020, 11:13am

It is unlikely.

Hyunsoo_Shim · April 7, 2020, 1:56am

Thank you for your information.
We are investigating further more for hang out of master node.

system · May 5, 2020, 1:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about "FileNotFoundException(no segments* file found)" while allocating replica shard Elasticsearch	2	1542	May 1, 2020
ES upgrade 0.20.6 to 1.4.2 -> CorruptIndexException and FileNotFoundException Elasticsearch	3	557	July 6, 2017
Problem upgrade from 6.1.2 to 6.5.4 Elasticsearch	15	1495	February 20, 2019
Failed to list shard for shard_store on node on big environments Elasticsearch	7	6845	June 7, 2019
Error in Search Guard issue tracker Elasticsearch	1	701	December 25, 2017

Question about the problem of “FileNotFoundException" while allocating replicas

Related topics