Problem upgrade from 6.1.2 to 6.5.4

klahnakoski · January 20, 2019, 5:19pm

I upgraded the data nodes (data=true, master=false), and stood up two new master nodes (data=false, master=true). My final step was to bounce the old master node, with the hope a new master (with new version) was elected. It was elected, but it is raising many errors.

What do I do now?

(some stack trace lines removed because posting limit)

[2019-01-20T16:58:57,244][WARN ][o.e.g.G.InternalReplicaShardAllocator] [master1] [unittest20190113_000000][6]: failed to list shard for shard_store on node [HrTEqeTNRZW8OfSEJ3Y2DA]
org.elasticsearch.action.FailedNodeException: Failed node [HrTEqeTNRZW8OfSEJ3Y2DA]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:237) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.lambda$handleException$32(TcpTransport.java:1268) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1266) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1258) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1188) [elasticsearch-6.5.4.jar:6.5.4]
        at 
Caused by: org.elasticsearch.transport.RemoteTransportException: [spot_54.190.10.44][172.31.1.87:9300][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[unittest20190113_000000][6]]
Caused by: java.io.FileNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/data2/nodes/0/indices/bbMLDFi5Qt2Z3anblhTX-Q/6/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@6789a88d)): files: [recovery.BCjzSmBfTdmezeG79jnG9Q._1ll.dii, recovery.BCjzSmBfTdmezeG79jnG9Q._1ll.dim, recovery.BCjzSmBfTdmezeG79jnG9Q.segments_68, write.lock]
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
        at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
        at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:131) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:201) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.store.Store.access$200(Store.java:129) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:851) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:784) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.store.Store.getMetadata(Store.java:287) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:1176) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:127) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:111) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.5.4.jar:6.5.4]
        ... 1 more

The other new master seems to be waiting for something:

[2019-01-20T17:08:12,465][INFO ][o.e.x.m.e.l.LocalExporter] [master2] waiting for elected master node [{master1}{68sPQmYrRdW60YXSGUeT2w}{1mXMJUiKQNSezNiFgOF9lQ}{172.31.1.13}{172.31.1.13:9300}{ml.machine_memory=2090577920, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true, zone=primary}] to setup local exporter [default_local] (does it have x-pack installed?)

The old master does not seem to be in good shape either:

[2019-01-20T16:55:05,335][INFO ][o.e.n.Node               ] [coordinator6] started
[2019-01-20T16:55:07,687][ERROR][i.n.u.ResourceLeakDetector] LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetection.level=advanced' or call ResourceLeakDetector.setLevel() See http://netty.io/wiki/reference-counted-objects.html for more information.
[2019-01-20T16:59:08,667][WARN ][o.e.t.TransportService   ] [coordinator6] Received response for a request that has timed out, sent [242977ms] ago, timed out [212976ms] ago, action [internal:discovery/zen/fd/master_ping], node [{master1}{68sPQmYrRdW60YXSGUeT2w}{1mXMJUiKQNSezNiFgOF9lQ}{172.31.1.13}{172.31.1.13:9300}{ml.machine_memory=2090577920, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true, zone=primary}], id [264]

klahnakoski · January 20, 2019, 5:27pm

While the master nodes still appear unhealthy because of the logs. The cluster appears to be responding to queries, and it appears to have not forgotten the placement of shards.

warkolm · January 21, 2019, 2:52am

2 masters is a bad idea, see Important Configuration Changes | Elasticsearch: The Definitive Guide [2.x] | Elastic.

This seems to be relevant.

What does _cat/shards look like?

klahnakoski · January 21, 2019, 4:02pm

warkolm, thank you

There are three potential masters: The two that were not elected were upgraded first. Then the third master was bounced to elect one of those two. Then I had the situation I posted. The third master was upgraded, and bounced one more time. All nodes are now upgraded, there are three nodes that can be elected master. The current master is still throwing the errors shown.

The _cat/shards has over 3000 entries, most are started, here are the most interesting lines. None of them raise alarm for me; with the constant churn of new nodes, I expect there to be some shards initializing, or relocating.

...
coverage20190114_000000            7  p STARTED        2334929  10.6gb 172.31.1.139 backup3
coverage20190114_000000            7  r INITIALIZING                   172.31.1.107 spot_54.187.150.103
coverage20190114_000000            22 p STARTED        2336125    11gb 172.31.1.139 backup3
...
coverage20190118_000000            0  p STARTED        1073163   4.8gb 172.31.1.122 backup1
unittest20190113_000000            6  r RELOCATING     7647076    16gb 172.31.1.109 spot_34.216.59.15 -> 172.31.1.64 WEKAYwQSR7-DwoJr43QLdw spot_34.215.232.192
unittest20190113_000000            6  p STARTED        7646821    16gb 172.31.1.122 backup1
...
unittest20190113_000000            38 p STARTED        7643134  16.2gb 172.31.1.7   backup2
unittest20190113_000000            58 r INITIALIZING                   172.31.1.159 spot_34.209.65.100
unittest20190113_000000            58 p STARTED        7640797  16.1gb 172.31.1.122 backup1
...
unittest20190113_000000            42 p STARTED        7627857    16gb 172.31.1.139 backup3
unittest20190113_000000            42 r RELOCATING     7627560  15.9gb 172.31.1.110 spot_34.222.223.157 -> 172.31.1.228 LUeaCXORRUeuF71ofNo5dA spot_34.210.107.191
unittest20190113_000000            46 r STARTED        7626761  15.9gb 172.31.1.153 spot_18.237.52.93
...
unittest20190113_000000            47 p STARTED        7637874  15.9gb 172.31.1.122 backup1
unittest20190113_000000            50 r RELOCATING     7639822  16.2gb 172.31.1.49  spot_34.222.166.65 -> 172.31.1.10 PZ00_77TRKC8RgiT7mH6Vg spot_54.149.151.0
unittest20190113_000000            50 p STARTED        7639338  16.3gb 172.31.1.122 backup1
...
unittest20181216_000000            21 p STARTED        9160694    19gb 172.31.1.122 backup1
unittest20181216_000000            17 r INITIALIZING                   172.31.1.165 spot_54.187.118.85
unittest20181216_000000            17 p STARTED        9143600    19gb 172.31.1.122 backup1
...
perf20180101_000000                15 r STARTED        7928999  19.2gb 172.31.1.165 spot_54.187.118.85
perf20180101_000000                15 r INITIALIZING                   172.31.1.98  spot_54.187.66.47
perf20180101_000000                15 p STARTED        7928999  19.2gb 172.31.1.122 backup1
...
perf20180101_000000                21 r STARTED        7928581  19.2gb 172.31.1.99  spot_18.236.242.252
perf20180101_000000                21 r INITIALIZING                   172.31.1.82  spot_18.237.156.75
perf20180101_000000                21 p STARTED        7928581  19.2gb 172.31.1.122 backup1
...
unittest20181223_000000            20 p STARTED        3968224   8.2gb 172.31.1.139 backup3
unittest20181223_000000            20 r RELOCATING     3968079   8.2gb 172.31.1.109 spot_34.216.59.15 -> 172.31.1.158 M4fWniPcRliL3CMrKn4Q0Q spot_34.221.130.126
unittest20181223_000000            55 r STARTED        3965423   8.2gb 172.31.1.83  spot_54.190.8.98
...
unittest20181230_000000            9  p STARTED        6808074  14.1gb 172.31.1.139 backup3
unittest20181230_000000            9  r RELOCATING     6808004  14.1gb 172.31.1.146 spot_34.221.3.212 -> 172.31.1.153 TB4RqDy9S2yfPhyPrr19yg spot_18.237.52.93
unittest20181230_000000            2  r STARTED        6809242  14.1gb 172.31.1.153 spot_18.237.52.93
...
debug-etl20181122_091655           1  p STARTED      317934911  64.4gb 172.31.1.139 backup3
debug-etl20181122_091655           1  r INITIALIZING                   172.31.1.35  spot_52.11.139.93
debug-etl20181122_091655           0  r STARTED      318102169  64.5gb 172.31.1.71  spot_54.186.242.143
...

klahnakoski · January 21, 2019, 4:13pm

I have confirmed the elected master is still complaining. I wonder if I bounce this master, the other masters will have the same problem. Each master (master=true, data=false, ingest=false) are alone, each on their own machine.

[2019-01-21T14:53:40,465][WARN ][o.e.g.G.InternalReplicaShardAllocator] [master1] [treeherder20181001_000000][9]: failed to list shard for shard_store on node [CsQhQNXzR1GjiRCXGUoknA]
org.elasticsearch.action.FailedNodeException: Failed node [CsQhQNXzR1GjiRCXGUoknA]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:237) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.5.4.jar:6.5.4]
	...
Caused by: org.elasticsearch.transport.RemoteTransportException: [spot_50.112.30.180][172.31.1.72:9300][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[treeherder20181001_000000][9]]
	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:113) ~[elasticsearch-6.5.4.jar:6.5.4]
	...
Caused by: java.io.FileNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/data1/nodes/0/indices/ZA_PgZamQqaLLL1Rup0w2w/9/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@e5faa98)): files: [recovery.7dBceZxpQVOomkCsLuYtyg._3sr.dii, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.dim, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.fdt, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.fdx, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.fnm, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.si, recovery.7dBceZxpQVOomkCsLuYtyg._3sr_1.liv, recovery.7dBceZxpQVOomkCsLuYtyg._3sr_Lucene50_0.doc, recovery.7dBceZxpQVOomkCsLuYtyg._3sr_Lucene50_0.tim, ... recovery.7dBceZxpQVOomkCsLuYtyg._e34.si, recovery.7dBceZxpQVOomkCsLuYtyg._e35.cfe, recovery.7dBceZxpQVOomkCsLuYtyg._e35.cfs, recovery.7dBceZxpQVOomkCsLuYtyg._e35.si, recovery.7dBceZxpQVOomkCsLuYtyg._e36.cfe, recovery.7dBceZxpQVOomkCsLuYtyg._e36.cfs, recovery.7dBceZxpQVOomkCsLuYtyg._e36.si, recovery.7dBceZxpQVOomkCsLuYtyg.segments_cz8, write.lock]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:131) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:201) ~[elasticsearch-6.5.4.jar:6.5.4]

The other two potential masters continue to wait for "to setup local exporter":

[2019-01-21T16:02:32,233][INFO ][o.e.x.m.e.l.LocalExporter] [master2] waiting for elected master node [{master1}{68sPQmYrRdW60YXSGUeT2w}{1mXMJUiKQNSezNiFgOF9lQ}{172.31.1.13}{172.31.1.13:9300}{ml.machine_memory=2090577920, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true, zone=primary}] to setup local exporter [default_local] (does it have x-pack installed?)

DavidTurner · January 21, 2019, 4:23pm

[2019-01-20T16:58:57,244][WARN ][o.e.g.G.InternalReplicaShardAllocator] [master1] [unittest20190113_000000][6]: failed to list shard for shard_store on node [HrTEqeTNRZW8OfSEJ3Y2DA]
org.elasticsearch.action.FailedNodeException: Failed node [HrTEqeTNRZW8OfSEJ3Y2DA]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:237) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.lambda$handleException$32(TcpTransport.java:1268) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1266) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1258) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1188) [elasticsearch-6.5.4.jar:6.5.4]
        at 
Caused by: org.elasticsearch.transport.RemoteTransportException: [spot_54.190.10.44][172.31.1.87:9300][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[unittest20190113_000000][6]]
Caused by: java.io.FileNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/data2/nodes/0/indices/bbMLDFi5Qt2Z3anblhTX-Q/6/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@6789a88d)): files: [recovery.BCjzSmBfTdmezeG79jnG9Q._1ll.dii, recovery.BCjzSmBfTdmezeG79jnG9Q._1ll.dim, recovery.BCjzSmBfTdmezeG79jnG9Q.segments_68, write.lock]

This looks like the result of a failed recovery, and will eventually be cleaned up when all the shards are allocated and the cluster becomes green.

waiting for elected master node [...] to setup local exporter [default_local] (does it have x-pack installed?)

This is basically benign although it does indicate that the master might be misconfigured or might be struggling to commit cluster state updates.

I don't know that there's any problem to fix here. Once the cluster has become healthy again I think these messages should stop. If they don't, or if the cluster appears to stop making progress towards health, then we can dig into that.

klahnakoski · January 21, 2019, 5:06pm

The cluster was green earlier this morning. It is green again.

warkolm · January 21, 2019, 9:15pm

How many indices, shards and nodes do you have?

klahnakoski · January 21, 2019, 9:33pm

Currently, there is 56 indices, 3243 shards, and 45 nodes (40 nodes with data=true, 3 nodes with master=true, 2 nodes with nothing)

warkolm · January 21, 2019, 9:33pm

That's nearly 60 shards per index, why so many?

klahnakoski · January 21, 2019, 9:34pm

The larger indexes are a terabyte

EDIT: For clarity, that is a terabyte before replicates are counted

DavidTurner · January 22, 2019, 10:01am

I am not sure if you mean that the problems are now resolved since the cluster is green again, or if you mean that there is still an ongoing problem despite the cluster being green again. Could you clarify?

klahnakoski · January 22, 2019, 7:52pm

The failures continued, but have since stopped. Here is the last one was 7 hours ago.

Is this error complaining about some other node? Is the problem on the filesystem of the master?

[2019-01-21T14:53:40,465][WARN ][o.e.g.G.InternalReplicaShardAllocator] [master1] [treeherder20181001_000000][9]: failed to list shard for shard_store on node [CsQhQNXzR1GjiRCXGUoknA]
org.elasticsearch.action.FailedNodeException: Failed node [CsQhQNXzR1GjiRCXGUoknA]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:237) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.transport.TcpTransport.lambda$handleException$32(TcpTransport.java:1268) ~[elasticsearch-6.5.4.jar:6.5.4]
	...   
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.4.jar:6.5.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_201]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_201]
	... 1 more
Caused by: java.io.FileNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/data1/nodes/0/indices/ZA_PgZamQqaLLL1Rup0w2w/9/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@e5faa98)): files: [recovery.7dBceZxpQVOomkCsLuYtyg._3sr.dii, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.dim, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.fdt, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.fdx, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.fnm, recovery.7dBceZxpQVOomkCsLuYtyg._3sr.si, recovery.7dBceZxpQVOomkCsLuYtyg._3sr_1.liv, 
...
write.lock]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:131) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:201) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.index.store.Store.access$200(Store.java:129) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:851) ~[elasticsearch-6.5.4.jar:6.5.4]
	...

DavidTurner · January 23, 2019, 10:24am

This is a message from the InternalReplicaShardAllocator indicating that it's trying to allocate a replica for shard [9] of index [treeherder20181001_000000], and is searching for existing copies of that shard. This means that not all replicas of this shard are allocated, and therefore at the time the cluster health was not green.

The specific exception is that it asked the node with ID CsQhQNXzR1GjiRCXGUoknA to look for a copy of this shard and got an exception back, because that node has a directory for this shard but that directory does not contain a complete shard copy (specifically, it has no file matching segments*). However, it has files with names that begin recovery...., which indicates that Elasticsearch was at an earlier time trying to build a copy of this shard on this node, and that this process failed before completion. Elasticsearch is quite averse to deleting data when unhealthy, so it leaves these files in place and logs warnings about them instead.

klahnakoski · January 23, 2019, 7:07pm

Excellent, if this happens again, then I can terminate the problem node (as long as it is not my last copy of the shard)

system · February 20, 2019, 7:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed to list shard for shard_store on node on big environments Elasticsearch	7	6845	June 7, 2019
Elasticsearch issue Elasticsearch	13	2051	July 6, 2017
The elasticsearch cluster often turns red,marking and sending shard failed due to [failed to create shard] Elasticsearch	9	197	August 13, 2024
Failed to start shard Elasticsearch	4	920	July 6, 2017
Shard lock issue Elasticsearch	11	2037	February 26, 2023

Problem upgrade from 6.1.2 to 6.5.4

Related topics