Problem with master-nodes after resizing underlying instances (EC2 AWS)

Hi,

we're running a Cluster with 3 dedicated Master on AWS EC2 Instances and changed the instance type. Every node got a clean and graceful shutdown and everything worked fine until we changed the elected master node.

=> the cluster was not accessible for a few minutes and at this time it seemed to be failing - the data-nodes were accessible the whole time but the master-nodes weren't.

After some loops the master-nodes regenerated themselves, but it wasn't possible for us to find the reason for the failure. We came the conclusion that the time interval between the changes was too narrow but we are not sure.

Here is an excerpt of the logs:

[2020-05-11T16:37:49,421][WARN ][o.e.t.TcpTransport       ] [es-master-41-XXX] exception caught on transport layer [Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/X.X.X.X:39326}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:472) ~[netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) ~[netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar:4.1.36.Final]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
	at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:320) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:263) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:258) ~[?:?]
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:129) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:681) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:636) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:454) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:433) ~[?:?]
	at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:634) ~[?:?]
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:295) ~[netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1332) ~[netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227) ~[netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) ~[netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) ~[netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) ~[netty-codec-4.1.36.Final.jar:4.1.36.Final]
	... 16 more
[2020-05-11T16:38:17,530][ERROR][o.e.x.s.a.e.NativeUsersStore] [es-master-41-XXX] security index is unavailable. short circuiting retrieval of user [...]

Do you got any suggestion why?

Best regards

The tiny log excerpt you shared does not look relevant, at least not in isolation. Please share all the logs from all the master nodes for the duration of the outage.

1 Like

Hi @DavidTurner,

great thanks to you for your support. In the following files you'll find the the complete logs excerpts from the 3 master nodes we use.

LINK

Thank you for your support
Martin

Hi @DavidTurner,

we would be happy to hear from you as further updates are pending.

Thanks a lot in advance
Martin

Ah yes, I missed your earlier link.

When you shut node 145 down, the other two nodes were not able to commit a cluster state quickly enough:

[2020-05-11T16:30:26,922][INFO ][o.e.c.c.JoinHelper       ] [es-master-41-73] failed to join {es-master-41-148}{uRmHVkNlS7i_UdLqpJaOPA}{VQCx5Qg7QDCQ_eoOdw8Exg}{10.10.41.148}{10.10.41.148:9300}{m}{ml.machine_memory=16521273344, ml.
org.elasticsearch.transport.RemoteTransportException: [es-master-41-148][10.10.41.148:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1378) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1302) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:88) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1269) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.3.2.jar:7.3.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: publication cancelled before committing: timed out after 30s
        at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:85) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1269) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.3.2.jar:7.3.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) ~[?:?]

There's no information in these logs indicating why it took over 30 seconds to process the next cluster state, but the usual reason is poor IO performance. There's been improvements to this area in more recent versions (better logging at least, but also better handling of slow disks) so I suggest you upgrade before trying anything else.