Multiple errors; "Failed to list store metadata for shard"

Hi all,
I have a problem with my elasticsearch cluster with logging a large number of errors.
In the night, when import is ongoing I receive many errors:

[2019-01-15T03:28:54,372][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [XXX-stg-util] [116-XXX-2019-01-14_23_59_02][0]: failed to list shard for shard_store on node [dU9OHwkgTIS8vR1JXLHEaQ]
org.elasticsearch.action.FailedNodeException: Failed node [dU9OHwkgTIS8vR1JXLHEaQ]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:239) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1067) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.TcpTransport.lambda$handleException$16(TcpTransport.java:1467) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:110) [elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457) [elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401) [elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-5.5.0.jar:5.5.0]


Caused by: org.elasticsearch.transport.RemoteTransportException: [XXX-stg-app2][10.30.4.6:9300][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[116-XXX-2019-01-14_23_59_02][0]]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:114) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:262) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:258) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.0.jar:5.5.0]

Caused by: java.io.FileNotFoundException: no segments* file found in store(mmapfs(/data/elastic/data/nodes/0/indices/I3ChJeMwRwevRF509mgYAw/0/index)): files: [recovery.AWhPVwK8S7hjY6TEdd7O._v.dii, recovery.AWhPVwK8S7hjY6TEdd7O._v.dim, recovery.AWhPVwK8S7hjY6TEdd7O._v.fdx, recovery.AWhPVwK8S7hjY6TEdd7O._v.fnm, recovery.AWhPVwK8S7hjY6TEdd7O._v.nvd, recovery.AWhPVwK8S7hjY6TEdd7O._v.nvm, recovery.AWhPVwK8S7hjY6TEdd7O._v.si, recovery.AWhPVwK8S7hjY6TEdd7O._v_Lucene50_0.doc, recovery.AWhPVwK8S7hjY6TEdd7O._v_Lucene50_0.pos, recovery.AWhPVwK8S7hjY6TEdd7O._v_Lucene50_0.tim, recovery.AWhPVwK8S7hjY6TEdd7O._v_Lucene50_0.tip, recovery.AWhPVwK8S7hjY6TEdd7O._v_Lucene54_0.dvd, recovery.AWhPVwK8S7hjY6TEdd7O._v_Lucene54_0.dvm, recovery.AWhPVwK8S7hjY6TEdd7O.segments_4, write.lock]
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:687) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:644) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
        at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:450) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
        at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:129) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:199) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.index.store.Store.access$200(Store.java:127) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:818) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:751) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.index.store.Store.getMetadata(Store.java:273) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:874) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:128) ~[elasticsearch-5.5.0.jar:5.5.0]

Do you have an idea what is causes of this problem?

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.30.4.7 31 99 1 0.14 0.09 0.14 mdi * XXX-stg-util
10.30.4.6 49 99 1 0.01 0.05 0.05 mdi - XXX-stg-app2
10.30.4.4 69 99 0 0.00 0.03 0.05 mdi - XXX-stg-app1

XXX-cluster-stg-2019-01-10.log:77495
XXX-cluster-stg-2019-01-11.log:381894
XXX-cluster-stg-2019-01-12.log:95410
XXX-cluster-stg-2019-01-13.log:56012
XXX-cluster-stg-2019-01-14.log:256925

logs is cutted, because message was too long

This suggests a node crashed while it was performing a recovery. It should stop being reported once all the shards have been allocated and your cluster health is green again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.