Elasticsearch nodes are leaving the cluster continuously

Hi,

The nodes in the cluster are leaving continuously and rejoins after few seconds. when I checked for error it says "master not discovered yet: have discovered" error. I don't have any clue what is happening. please can anyone help me to solve this issue.

Sharing your logs would be helpful.

Please also provide some information about the cluster and how it is deployed.

Hi Warkolm and Christian,

[2020-07-28T12:29:31,839][INFO ][o.e.c.c.JoinHelper       ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=82605, lastAcceptedTerm=82569, lastAcceptedVersion=1717390, sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, targetNode={prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 82611 while handling publication
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.1.jar:7.2.1]
2020-07-28T12:29:32,469][INFO ][o.e.m.j.JvmGcMonitorService] [prod-poc-node16] [gc][4214] overhead, spent [256ms] collecting in the last [1s]
[2020-07-28T12:29:37,529][INFO ][o.e.m.j.JvmGcMonitorService] [prod-poc-node16] [gc][4219] overhead, spent [286ms] collecting in the last [1s]
[2020-07-28T12:29:41,690][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [prod-poc-node16] no known master node, scheduling a retry
[2020-07-28T12:29:41,815][INFO ][o.e.c.s.ClusterApplierService] [prod-poc-node16] master node changed {previous [], current [{prod-poc-node14}{d0GsDeHSSqqI8ZULw61nJA}{07hcxeFQRxmi4bq_4eAGBQ}{*.*.*.*}{*.*.*.*:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}]}, term: 82625, version: 1717391, reason: ApplyCommitRequest{term=82625, version=1717391, sourceNode={prod-poc-node14}{d0GsDeHSSqqI8ZULw61nJA}{07hcxeFQRxmi4bq_4eAGBQ}{*.*.*.*}{*.*.*.*:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}
[2020-07-28T12:29:43,341][WARN ][o.e.t.TcpTransport       ] [prod-poc-node16] exception caught on transport layer [Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.132.254.49:51274}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:472) ~[netty-codec-4.1.35.Final.jar:4.1.35.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.35.Final.jar:4.1.35.Final]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
        at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]
Caused by: javax.crypto.BadPaddingException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
        at sun.security.ssl.SSLCipher$T13GcmReadCipherGenerator$GcmReadCipher.decrypt(SSLCipher.java:1878) ~[?:?]
        at sun.security.ssl.SSLEngineInputRecord.decodeInputRecord(SSLEngineInputRecord.java:240) ~[?:?]
[2020-07-28T12:29:57,147][INFO ][o.e.c.s.ClusterApplierService] [prod-poc-node16] removed {{prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true},}, term: 82625, version: 1717392, reason: ApplyCommitRequest{term=82625, version=1717392, sourceNode={prod-poc-node14}{d0GsDeHSSqqI8ZULw61nJA}{07hcxeFQRxmi4bq_4eAGBQ}{*.*.*.*}{*.*.*.*:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}
[2020-07-28T12:29:57,426][INFO ][o.e.c.c.JoinHelper       ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional.empty}
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join] disconnected
[2020-07-28T12:29:57,427][INFO ][o.e.c.c.JoinHelper       ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional.empty}
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join] disconnected
[2020-07-28T12:29:57,428][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [prod-poc-node16] connection exception while trying to forward request with action name [indices:admin/create] to master node [{prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][indices:admin/create] disconnected]
[2020-07-28T12:29:57,428][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [prod-poc-node16] timed out while retrying [indices:admin/create] after failure (timeout [1m])
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][indices:admin/create] disconnected
[2020-07-28T12:29:57,429][INFO ][o.e.c.c.JoinHelper       ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=82611, lastAcceptedTerm=82569, lastAcceptedVersion=1717390, sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, targetNode={prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join] disconnected
[2020-07-28T12:29:57,429][INFO ][o.e.c.c.JoinHelper       ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=82611, lastAcceptedTerm=82569, lastAcceptedVersion=1717390, sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, targetNode={prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join] disconnected
Caused by: org.elasticsearch.transport.RemoteTransportException: [prod-poc-node14][*.*.*.*:9300][indices:admin/create]
Caused by: java.lang.IllegalArgumentException: Validation Failed: 1: this action would add [1] total shards, but this cluster currently has [22134]/[13000] maximum shards open;
        at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.checkShardLimit(MetaDataCreateIndexService.java:657) ~[elasticsearch-7.2.1.jar:7.2.1]
[2020-07-28T12:30:15,846][WARN ][o.e.x.m.MonitoringService] [prod-poc-node16] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:121) ~[?:?]
        ... 58 more
[2020-07-28T12:30:31,127][WARN ][o.e.x.m.MonitoringService] [prod-poc-node16] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
        

This is the log from one of my data node(some of the lines are truncated).
The cluster details are -> dedicated master nodes: 3 | dedicated coordinate node: 1 | data nodes: 13 | RAM on each data node: 48GB(allocated 24GB heap) | CPUs on data node: 16
Total shard count: 22,142
Total data size in the cluster: 32TB
I have stopped all the queries and ingestion, still the nodes are leaving the cluster with " master not discovered yet: have discovered" error.
Please advice what could be the issue. Also let me know if further information required.

You have way too many shards, take a look at https://www.elastic.co/guide/en/elasticsearch/reference/7.8/avoid-oversharding.html

Thanks for the reply. My concern is how can i fix the shard issue for the existing indices? because the nodes are leaving the cluster continuously and cluster status is still red and can't access any data in the cluster.

Are the master up and stable, i.e. one got elected and they can provide status, health, and feedback quickly? Seems maybe not as messages like "node is no longer master for term" imply masters are changing (maybe I don't read that right).

We used to have too many shards for our RAM and was very obvious as the masters we not stable, even at 16GB Heap; if not stable, maybe try to get more Heap in masters; you can go to 32GB first or even more like 40GB (yes, slower, but you gotta get this stable enough to fix things).

If masters are stable, then seems odd the nodes can't 'find' them - if you were querying & ingesting, this implies this used to work, so what changed - did it degrade over time or suddenly die?

If the cluster is stable you can at least get rid of replicas on indexes that have too many, i.e. set replica count to 1 and hope it purges a lot of them; if you already have only 1 replica, you can go to 0 if not value the data or not as important as fixing things, then have to shrink indexes to fewer shards if you can (there is API for this).

Then work down to stability from there, but agreed you need data nodes - the cluster being red is not that important as much as getting sets of indexes to yellow so you can work on them without losing data. And hope you can get enough nodes back connected to get more indexes to yellow, and so on until all green.

(I also wonder if you lost nodes, the cluster added more shards, but never purged the old ones due to instability (and never going green), then they keep piling up as it can never keep the nodes stable enough.)

Steve,

Thanks for the insights, we have replica set to 1 and i am nervous to make it 0 as this is a production system. Since, we store 30 days of data per customer and indices are one per day, per log type. I am thinking of closing the older 10 days worth of indices(changing the index state to close). This i am hoping will release some shards to make the cluster stable and once it is stable i will open one index at a time and shrink it.

Will this strategy works? without me loosing any data.

Thanks in advance.

Oh, yeah, forgot about that - we did exactly that when we had thousands of daily/type-driven indexes; just close anything older than 10, etc. days to get them out of RAM, etc. then stabilize the cluster, etc. Agreed on replica < 1 is not great.

Thanks Steve,

we are on the process of closing the indices and once the cluster is stable i will get back

The shard count is related to your instability. Closing this is a start, but you will need to do a lot more work.

1 Like

Thank you guys for your support,

We are in the process of re-indexing to reduce the shards count. Is there any other best way to reduce the shards count? Initially, we thought of closing index, free up some resources and reduce the shards count, it worked in previous version cluster(6.2.1) but it didn't work in 7.2.1 and ran into some issue like "metadata missing for index", then will delete the manifest file on the data node solves the issue. Now we have deleted some of the indices with 350 GB and 3 shards(each shard holding more than 120 GB) and cluster came up to green after monitoring few hours.

Can you advise, which is the best way to reduce the shards count? Thank you.

That is not a good idea. Never delete files from the filesystem underneath Elasticsearch.

The best long term plan is to use ILM.

Sure, i will use ILM to delete indices in Elasticsearch. For to bring the cluster stable i have deleted some indices manually that causing issues.

I have one more question,

In our other cluster master nodes are on data nodes(3 master nodes) and i am planning to add 3 separate dedicated master nodes. How can i switch that master nodes without affecting the cluster. We have 11 nodes cluster with 3 master and 11 data nodes.

Also, which is the best way to reduce the shards count existing indices?

Thanks in advance.

There is a shrink API in later versions to reduce shard count - BUT all shards need to be on the same node first, so if they are large, this is annoying as you have to move them, but is otherwise nice; and the index has to be green (easiest by removing replicas).

Else I assume you have to reindex to shrink if you can't get shards to the same node. That doc page has details on the shrink part.

Thanks for the answer.

What about the switching master nodes?

I had a long detailed master migration procedure written, but given how important it is, would prefer official advice and it's better if you get an elasticsearch person's blessing/process (I'm just an end-user contributing to the discussion).

The general process is add new masters and remove old ones, one-at-a-time allowing the cluster to stabilize and get green in between. But there are a few other things to do, so it's good to get better advice first; I'll keep my draft if no one helps, and maybe turn it into a blog :wink:

This process will depend on the version used. The logic has changed in Elasticsearch 7.x compared to earlier versions.

He's on 7.2.1 but isn't basic logic still the following, going slowly and always avoiding losing 50% of existing masters:

  • Add three new nodes as master-eligible (master nodes 3->6)
  • Restart two existing master-eligible, but non-master nodes with master=false (master nodes 6->4)
  • Restart existing master with master=false, (master nodes 4->3)

For every step, going slowly, one node at a time, waiting for cluster to settle back to green between each node start/stop/restart.

Thank you