Hi Warkolm and Christian,
[2020-07-28T12:29:31,839][INFO ][o.e.c.c.JoinHelper ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=82605, lastAcceptedTerm=82569, lastAcceptedVersion=1717390, sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, targetNode={prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 82611 while handling publication
at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.1.jar:7.2.1]
2020-07-28T12:29:32,469][INFO ][o.e.m.j.JvmGcMonitorService] [prod-poc-node16] [gc][4214] overhead, spent [256ms] collecting in the last [1s]
[2020-07-28T12:29:37,529][INFO ][o.e.m.j.JvmGcMonitorService] [prod-poc-node16] [gc][4219] overhead, spent [286ms] collecting in the last [1s]
[2020-07-28T12:29:41,690][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [prod-poc-node16] no known master node, scheduling a retry
[2020-07-28T12:29:41,815][INFO ][o.e.c.s.ClusterApplierService] [prod-poc-node16] master node changed {previous [], current [{prod-poc-node14}{d0GsDeHSSqqI8ZULw61nJA}{07hcxeFQRxmi4bq_4eAGBQ}{*.*.*.*}{*.*.*.*:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}]}, term: 82625, version: 1717391, reason: ApplyCommitRequest{term=82625, version=1717391, sourceNode={prod-poc-node14}{d0GsDeHSSqqI8ZULw61nJA}{07hcxeFQRxmi4bq_4eAGBQ}{*.*.*.*}{*.*.*.*:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}
[2020-07-28T12:29:43,341][WARN ][o.e.t.TcpTransport ] [prod-poc-node16] exception caught on transport layer [Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.132.254.49:51274}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:472) ~[netty-codec-4.1.35.Final.jar:4.1.35.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.35.Final.jar:4.1.35.Final]
at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]
Caused by: javax.crypto.BadPaddingException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at sun.security.ssl.SSLCipher$T13GcmReadCipherGenerator$GcmReadCipher.decrypt(SSLCipher.java:1878) ~[?:?]
at sun.security.ssl.SSLEngineInputRecord.decodeInputRecord(SSLEngineInputRecord.java:240) ~[?:?]
[2020-07-28T12:29:57,147][INFO ][o.e.c.s.ClusterApplierService] [prod-poc-node16] removed {{prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true},}, term: 82625, version: 1717392, reason: ApplyCommitRequest{term=82625, version=1717392, sourceNode={prod-poc-node14}{d0GsDeHSSqqI8ZULw61nJA}{07hcxeFQRxmi4bq_4eAGBQ}{*.*.*.*}{*.*.*.*:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}
[2020-07-28T12:29:57,426][INFO ][o.e.c.c.JoinHelper ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional.empty}
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join] disconnected
[2020-07-28T12:29:57,427][INFO ][o.e.c.c.JoinHelper ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional.empty}
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join] disconnected
[2020-07-28T12:29:57,428][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [prod-poc-node16] connection exception while trying to forward request with action name [indices:admin/create] to master node [{prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][indices:admin/create] disconnected]
[2020-07-28T12:29:57,428][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [prod-poc-node16] timed out while retrying [indices:admin/create] after failure (timeout [1m])
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][indices:admin/create] disconnected
[2020-07-28T12:29:57,429][INFO ][o.e.c.c.JoinHelper ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=82611, lastAcceptedTerm=82569, lastAcceptedVersion=1717390, sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, targetNode={prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join] disconnected
[2020-07-28T12:29:57,429][INFO ][o.e.c.c.JoinHelper ] [prod-poc-node16] failed to join {prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=82611, lastAcceptedTerm=82569, lastAcceptedVersion=1717390, sourceNode={prod-poc-node16}{LUH1uS6VToiCu7JJ_O9WYg}{O_FG4fUCS2emSu-zXsuKmQ}{10.132.29.67}{10.132.29.67:9300}{ml.machine_memory=50476199936, xpack.installed=true, ml.max_open_jobs=20}, targetNode={prod-poc-node13}{1H8SdgrDTnWA_C_POxS5WA}{joIVxOPnRWmEpUXFx1CMDg}{10.132.254.49}{10.132.254.49:9300}{ml.machine_memory=8200970240, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.NodeDisconnectedException: [prod-poc-node13][10.132.254.49:9300][internal:cluster/coordination/join] disconnected
Caused by: org.elasticsearch.transport.RemoteTransportException: [prod-poc-node14][*.*.*.*:9300][indices:admin/create]
Caused by: java.lang.IllegalArgumentException: Validation Failed: 1: this action would add [1] total shards, but this cluster currently has [22134]/[13000] maximum shards open;
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.checkShardLimit(MetaDataCreateIndexService.java:657) ~[elasticsearch-7.2.1.jar:7.2.1]
[2020-07-28T12:30:15,846][WARN ][o.e.x.m.MonitoringService] [prod-poc-node16] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:121) ~[?:?]
... 58 more
[2020-07-28T12:30:31,127][WARN ][o.e.x.m.MonitoringService] [prod-poc-node16] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
This is the log from one of my data node(some of the lines are truncated).
The cluster details are -> dedicated master nodes: 3 | dedicated coordinate node: 1 | data nodes: 13 | RAM on each data node: 48GB(allocated 24GB heap) | CPUs on data node: 16
Total shard count: 22,142
Total data size in the cluster: 32TB
I have stopped all the queries and ingestion, still the nodes are leaving the cluster with " master not discovered yet: have discovered" error.
Please advice what could be the issue. Also let me know if further information required.