Failed to list shard for shard_store on node on big environments

kley · May 6, 2019, 11:05am

Hi!

Before I ask my question, i'll tell you how the environment looks like:
We do have a test environment with 6 elasticsearch nodes. Indices are set to two replicas each. Datawise, we have ~5TB of data on each node.

We updated elasticsearch to version 6.2.x lately and now we do get on environments like this the following error pretty often:

[2019-05-02T16:16:14,726][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [192.168.1.5] [visit-global-daily35-d2019.05.01][1]: failed to list shard for shard_store on node [XkMhhclsSVa8lEaMm4YbJA]
org.elasticsearch.action.FailedNodeException: Failed node [XkMhhclsSVa8lEaMm4YbJA]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:239) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1098) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.TcpTransport.lambda$handleException$33(TcpTransport.java:1478) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1476) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1468) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1398) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) [transport-netty4-6.2.4.jar:6.2.4]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.16.Final.jar:4.1.16.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]

I noticed, this error occurs only when we restart the nodes.
My questions now are:

why does this message spam the logfile like this ? We get this message couple of times per second.
second, regarding this answer (Multiple errors; "Failed to list store metadata for shard") this error occurs when a node crashed while recovering. How can i prevent a node from crashing?
third, why does this log only occur on nodes with a huge amount of data?

I hope I provided you with all information needed - if not, just ask

thanks in advance!

kley · May 6, 2019, 11:29am

Small addition, I just found out that some of the indices have shards with 100gb+ of data stored - which is not optimal. Can something like this be the root cause?

DavidTurner · May 6, 2019, 1:31pm

Is that the whole log message or are there Caused by lines too? I think there should be inner exceptions and without them it's pretty hard to say much more than "something went wrong on this node".

kley · May 6, 2019, 1:39pm

Ah sorry, I had to cut the exception short because of the 7000 character limit

Here we go (the second half of the message):

Caused by: org.elasticsearch.transport.RemoteTransportException: [192.168.1.4][192.168.1.4:9300][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[visit-global-daily35-d2019.04.28][0]]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:111) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:262) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:258) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
	... 1 more
Caused by: java.io.FileNotFoundException: no segments* file found in store(MMapDirectory@/srv/session/elasticsearch/nodes/0/indices/6sPEl_HKSfu97w2rMklQgg/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2825f932): files: [recovery.02Nsyk8TREClSvQxt93PXg._522.dii, recovery.02Nsyk8TREClSvQxt93PXg._522.dim, recovery.02Nsyk8TREClSvQxt93PXg._522.fdt, recovery.02Nsyk8TREClSvQxt93PXg._522.fdx, recovery.02Nsyk8TREClSvQxt93PXg._522.fnm, recovery.02Nsyk8TREClSvQxt93PXg._522.si, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene50_0.doc, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene50_0.pos, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene50_0.tim, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene50_0.tip, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene70_0.dvd, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene70_0.dvm, recovery.02Nsyk8TREClSvQxt93PXg.segments_72, write.lock]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:670) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:627) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:434) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
	at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:123) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:202) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.index.store.Store.access$200(Store.java:130) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:859) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:792) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.index.store.Store.getMetadata(Store.java:288) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:1143) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:125) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:109) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:262) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:258) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
	... 1 more

DavidTurner · May 6, 2019, 2:14pm

Thanks, ok, this does indeed look the same as the post you linked:

kley:

Caused by: java.io.FileNotFoundException: no segments* file found in store(MMapDirectory@/srv/session/elasticsearch/nodes/0/indices/6sPEl_HKSfu97w2rMklQgg/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2825f932): files: [recovery.02Nsyk8TREClSvQxt93PXg._522.dii, recovery.02Nsyk8TREClSvQxt93PXg._522.dim, recovery.02Nsyk8TREClSvQxt93PXg._522.fdt, recovery.02Nsyk8TREClSvQxt93PXg._522.fdx, recovery.02Nsyk8TREClSvQxt93PXg._522.fnm, recovery.02Nsyk8TREClSvQxt93PXg._522.si, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene50_0.doc, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene50_0.pos, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene50_0.tim, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene50_0.tip, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene70_0.dvd, recovery.02Nsyk8TREClSvQxt93PXg._522_Lucene70_0.dvm, recovery.02Nsyk8TREClSvQxt93PXg.segments_72, write.lock]

Something stopped this shard copy mid-recovery and prevented it from cleaning up after itself, leaving some unexpected files in the directory. Elasticsearch leaves these files alone until the shard is fully allocated, and only then will it delete them. Arguably this state could perhaps be treated as less "unexpected" and handled more gracefully, but the only real benefit to this would be quieter logs.

The action that is failing is something that Elasticsearch retries until the shard is fully allocated.

Nodes can crash for lots of reasons: power loss, hardware failure, operator error, misconfiguration, etc. Node resilience is a broad topic and improvements are an active area of development. Without knowing how the node crashed it's hard to give a more specific answer. There should be information in the logs about this. (Strictly speaking it's possible that the cleanup failed without the node crashing, but I think this would still leave information in the logs).

I would guess that recoveries of a large amount of data take correspondingly longer, so the probability of any event occuring during such a recovery is higher. The failure could possibly be related to the recovery of a large shard too; it's hard to know without more information.

100GB might not be unreasonable for a single shard, depending on your use case. We've seen much larger ones used successfully. Generally we see more stability problems with too many too-small shards rather than too few too-large ones.

Can you look back at the history of this node and determine what happened at around the time you started getting these messages? Was it shut down in a strange way for instance?

kley · May 7, 2019, 9:31am

Thanks David for the response! I'll have to look deeper on some messages but I'll try to list what happened.
So, whenever we restart an elastic node we disable routing (updating [cluster.routing.allocation.enable] from [all] to [none]) and stop elasticsearch afterwards. In this case we got some warnings which are expected:

[2019-05-02T14:56:15,425][WARN ][o.e.a.b.TransportShardBulkAction] [192.168.1.5] [[visit-global-daily35-d2019.05.02-16][0]] failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02-16][0], node[g78Dg5QeTqytaMNr5bgQ6Q], [R], s[STARTED], a[id=msVXSk_JSvy8VsnjuIP6Cg]
org.elasticsearch.transport.NodeNotConnectedException: [192.168.1.6][192.168.1.6:9300] Node not connected

and we got some warnings about writing bulks on replicas:

[2019-05-02T14:56:15,419][WARN ][o.e.a.b.TransportShardBulkAction] [192.168.1.5] [[visit-global-daily35-d2019.05.02][1]] failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02][1], node[XkMhhclsSVa8lEaMm4YbJA], [R], s[STARTED], a[id=HqNoHYh5SKy3_Z9ZvIW5lA]
org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.4][192.168.1.4:9300][indices:data/write/bulk[s][r]] disconnected
...
[2019-05-02T14:56:15,496][WARN ][o.e.c.a.s.ShardStateAction] [192.168.1.5] [visit-global-daily35-d2019.05.02-4][0] node closed while execution action [internal:cluster/shard/failure] for shard entry [shard id [[visit-global-daily35-d2019.05.02-4][0]], allocation id [ldyqt-u0SvKpj2-5JlU25Q], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02-4][0], node[g78Dg5QeTqytaMNr5bgQ6Q], [R], s[STARTED], a[id=ldyqt-u0SvKpj2-5JlU25Q]], failure [NodeDisconnectedException[[192.168.1.6][192.168.1.6:9300][indices:data/write/bulk[s][r]] disconnected]]]
org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.6][192.168.1.6:9300][indices:data/write/bulk[s][r]] disconnected
...
[2019-05-02T14:56:15,501][WARN ][o.e.t.n.Netty4Transport  ] [192.168.1.5] send message failed [channel: NettyTcpChannel{localAddress=/192.168.1.5:9300, remoteAddress=/192.168.1.10:45222}]
org.elasticsearch.transport.TransportException: Cannot send message, event loop is shutting down.

Which for my understanding are ok, because the cluster is in shutdown state.

After that the server is starting with some common warnings like no known master or failed to connect to node...

Cluster switches to red and is scheduling reroutes:

[2019-05-02T15:40:17,763][INFO ][o.e.c.r.a.AllocationService] [192.168.1.5] Cluster health status changed from [YELLOW] to [RED] (reason: [{192.168.1.9}{VTAbbKBuR-qNHMSyCnrO2Q}{sbDqadJ2Sz2pnuOeAhBwkQ}{192.168.1.9}{192.168.1.9:9300} transport disconnected]).
[2019-05-02T15:40:17,764][INFO ][o.e.c.s.MasterService    ] [192.168.1.5] zen-disco-node-failed({192.168.1.9}{VTAbbKBuR-qNHMSyCnrO2Q}{sbDqadJ2Sz2pnuOeAhBwkQ}{192.168.1.9}{192.168.1.9:9300}), reason(transport disconnected)[{192.168.1.9}{VTAbbKBuR-qNHMSyCnrO2Q}{sbDqadJ2Sz2pnuOeAhBwkQ}{192.168.1.9}{192.168.1.9:9300} transport disconnected], reason: removed {{192.168.1.9}{VTAbbKBuR-qNHMSyCnrO2Q}{sbDqadJ2Sz2pnuOeAhBwkQ}{192.168.1.9}{192.168.1.9:9300},}
[2019-05-02T15:40:18,607][INFO ][o.e.c.s.ClusterApplierService] [192.168.1.5] removed {{192.168.1.9}{VTAbbKBuR-qNHMSyCnrO2Q}{sbDqadJ2Sz2pnuOeAhBwkQ}{192.168.1.9}{192.168.1.9:9300},}, reason: apply cluster state (from master [master {192.168.1.5}{B1Rgj8yoQuWjh6IKxvlHUg}{uznnrIPfSuaJQG44I-oVDg}{192.168.1.5}{192.168.1.5:9300} committed version [10231] source [zen-disco-node-failed({192.168.1.9}{VTAbbKBuR-qNHMSyCnrO2Q}{sbDqadJ2Sz2pnuOeAhBwkQ}{192.168.1.9}{192.168.1.9:9300}), reason(transport disconnected)[{192.168.1.9}{VTAbbKBuR-qNHMSyCnrO2Q}{sbDqadJ2Sz2pnuOeAhBwkQ}{192.168.1.9}{192.168.1.9:9300} transport disconnected]]])
[2019-05-02T15:40:19,003][INFO ][o.e.c.r.DelayedAllocationService] [192.168.1.5] scheduling reroute for delayed shards in [58.5s] (428 delayed shards)
[2019-05-02T15:40:19,020][WARN ][o.e.c.a.s.ShardStateAction] [192.168.1.5] [visit-global-daily35-d2019.05.02-1024][0] received shard failed for shard id [[visit-global-daily35-d2019.05.02-1024][0]], allocation id [a1ch1cCfRiawepbFUiJnDg], primary term [2], message [mark copy as stale]
[2019-05-02T15:41:17,602][INFO ][o.e.c.r.DelayedAllocationService] [192.168.1.5] scheduling reroute for delayed shards in [12.7s] (325 delayed shards)
[2019-05-02T15:41:30,381][INFO ][o.e.c.r.DelayedAllocationService] [192.168.1.5] scheduling reroute for delayed shards in [0s] (306 delayed shards)
[2019-05-02T15:41:30,444][INFO ][o.e.c.r.DelayedAllocationService] [192.168.1.5] scheduling reroute for delayed shards in [6.8s] (291 delayed shards)
[2019-05-02T15:41:37,325][INFO ][o.e.c.r.DelayedAllocationService] [192.168.1.5] scheduling reroute for delayed shards in [651.3ms] (290 delayed shards)
[2019-05-02T15:41:38,026][INFO ][o.e.c.r.DelayedAllocationService] [192.168.1.5] scheduling reroute for delayed shards in [64.9ms] (289 delayed shards)
[2019-05-02T15:41:38,147][INFO ][o.e.c.r.DelayedAllocationService] [192.168.1.5] scheduling reroute for delayed shards in [52.1s] (288 delayed shards)

At some point we turn on pdating [cluster.routing.allocation.enable] from [none] to [all] and right after that the shard_store exception kicks in.

If you want i can send you the full logfile for that - cannot post the whole exception messages because of the character limit

kley · May 10, 2019, 10:56am

One thing I don't understand is the fact, that after a node left the cluster and starts the stopping process it still want to write some data?

[2019-05-02T12:31:28,026][INFO ][o.e.m.j.JvmGcMonitorService] [192.168.1.5] [gc][576469] overhead, spent [318ms] collecting in the last [1.1s]
[2019-05-02T14:52:45,846][INFO ][o.e.c.s.ClusterSettings  ] [192.168.1.5] updating [cluster.routing.allocation.enable] from [all] to [none]
[2019-05-02T14:56:15,349][INFO ][o.e.n.Node               ] [192.168.1.5] stopping ...
[2019-05-02T14:56:15,419][WARN ][o.e.a.b.TransportShardBulkAction] [192.168.1.5] [[visit-global-daily35-d2019.05.02][1]] failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02][1], node[XkMhhclsSVa8lEaMm4YbJA], [R], s[STARTED], a[id=HqNoHYh5SKy3_Z9ZvIW5lA]
org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.4][192.168.1.4:9300][indices:data/write/bulk[s][r]] disconnected
[2019-05-02T14:56:15,419][WARN ][o.e.a.b.TransportShardBulkAction] [192.168.1.5] [[visit-global-daily35-d2019.05.02-4][0]] failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02-4][0], node[gfky9iVcR6-9wLm7o1K6uA], [R], s[STARTED], a[id=Sp0OFgvzT9yzwZlaKR2S-w]
org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.8][192.168.1.8:9300][indices:data/write/bulk[s][r]] disconnected
[2019-05-02T14:56:15,424][WARN ][o.e.a.b.TransportShardBulkAction] [192.168.1.5] [[visit-global-daily35-d2019.05.02-4][0]] failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02-4][0], node[g78Dg5QeTqytaMNr5bgQ6Q], [R], s[STARTED], a[id=ldyqt-u0SvKpj2-5JlU25Q]
org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.6][192.168.1.6:9300][indices:data/write/bulk[s][r]] disconnected
[2019-05-02T14:56:15,425][WARN ][o.e.a.b.TransportShardBulkAction] [192.168.1.5] [[visit-global-daily35-d2019.05.02][1]] failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02][1], node[q1iinDqXSIW7G-SzNuZdPQ], [R], s[STARTED], a[id=lQH8nqoNRfizzLc2fKTSkQ]
org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.10][192.168.1.10:9300][indices:data/write/bulk[s][r]] disconnected
[2019-05-02T14:56:15,427][WARN ][o.e.a.b.TransportShardBulkAction] [192.168.1.5] [[visit-global-daily35-d2019.05.02][1]] failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02][1], node[q1iinDqXSIW7G-SzNuZdPQ], [R], s[STARTED], a[id=lQH8nqoNRfizzLc2fKTSkQ]
org.elasticsearch.transport.NodeDisconnectedException: [192.168.1.10][192.168.1.10:9300][indices:data/write/bulk[s][r]] disconnected
[2019-05-02T14:56:15,425][WARN ][o.e.a.b.TransportShardBulkAction] [192.168.1.5] [[visit-global-daily35-d2019.05.02-16][0]] failed to perform indices:data/write/bulk[s] on replica [visit-global-daily35-d2019.05.02-16][0], node[g78Dg5QeTqytaMNr5bgQ6Q], [R], s[STARTED], a[id=msVXSk_JSvy8VsnjuIP6Cg]
org.elasticsearch.transport.NodeNotConnectedException: [192.168.1.6][192.168.1.6:9300] Node not connected

Are those some leftover inserts from before?

system · June 7, 2019, 10:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple errors; "Failed to list store metadata for shard" Elasticsearch	2	3060	February 12, 2019
Problem upgrade from 6.1.2 to 6.5.4 Elasticsearch	15	1495	February 20, 2019
BroadcastShardOperationFailedException Elasticsearch	13	869	July 6, 2017
Disappearing Shards Elasticsearch	10	406	July 6, 2017
ElasticSearch with > 40 nodes, missing shards and indexing troubles Elasticsearch	11	648	July 6, 2017

Failed to list shard for shard_store on node on big environments

Related topics