Problem with master-nodes after resizing underlying instances (EC2 AWS)

Mathgox · May 12, 2020, 8:31am

Hi,

we're running a Cluster with 3 dedicated Master on AWS EC2 Instances and changed the instance type. Every node got a clean and graceful shutdown and everything worked fine until we changed the elected master node.

=> the cluster was not accessible for a few minutes and at this time it seemed to be failing - the data-nodes were accessible the whole time but the master-nodes weren't.

After some loops the master-nodes regenerated themselves, but it wasn't possible for us to find the reason for the failure. We came the conclusion that the time interval between the changes was too narrow but we are not sure.

Here is an excerpt of the logs:

[2020-05-11T16:37:49,421][WARN ][o.e.t.TcpTransport       ] [es-master-41-XXX] exception caught on transport layer [Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/X.X.X.X:39326}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:472) ~[netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) ~[netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar:4.1.36.Final]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
	at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:320) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:263) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:258) ~[?:?]
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:129) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:681) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:636) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:454) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:433) ~[?:?]
	at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:634) ~[?:?]
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:295) ~[netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1332) ~[netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227) ~[netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) ~[netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) ~[netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) ~[netty-codec-4.1.36.Final.jar:4.1.36.Final]
	... 16 more

[2020-05-11T16:38:17,530][ERROR][o.e.x.s.a.e.NativeUsersStore] [es-master-41-XXX] security index is unavailable. short circuiting retrieval of user [...]

Do you got any suggestion why?

Best regards

DavidTurner · May 12, 2020, 10:10am

The tiny log excerpt you shared does not look relevant, at least not in isolation. Please share all the logs from all the master nodes for the duration of the outage.

Mathgox · May 14, 2020, 9:36am

Hi @DavidTurner,

great thanks to you for your support. In the following files you'll find the the complete logs excerpts from the 3 master nodes we use.

LINK

Thank you for your support
Martin

Mathgox · June 3, 2020, 8:52am

Hi @DavidTurner,

we would be happy to hear from you as further updates are pending.

Thanks a lot in advance
Martin

DavidTurner · June 3, 2020, 9:57am

Ah yes, I missed your earlier link.

When you shut node 145 down, the other two nodes were not able to commit a cluster state quickly enough:

[2020-05-11T16:30:26,922][INFO ][o.e.c.c.JoinHelper       ] [es-master-41-73] failed to join {es-master-41-148}{uRmHVkNlS7i_UdLqpJaOPA}{VQCx5Qg7QDCQ_eoOdw8Exg}{10.10.41.148}{10.10.41.148:9300}{m}{ml.machine_memory=16521273344, ml.
org.elasticsearch.transport.RemoteTransportException: [es-master-41-148][10.10.41.148:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1378) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1302) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:88) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1269) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.3.2.jar:7.3.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: publication cancelled before committing: timed out after 30s
        at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:85) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1269) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.3.2.jar:7.3.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) ~[?:?]

There's no information in these logs indicating why it took over 30 seconds to process the next cluster state, but the usual reason is poor IO performance. There's been improvements to this area in more recent versions (better logging at least, but also better handling of slow disks) so I suggest you upgrade before trying anything else.

Mathgox · June 4, 2020, 6:04am

Good morging,

thanks a lot for your support and your suggest. We'll update the cluster asap.

Kind regards
Martin

system · July 2, 2020, 6:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Very Strange Master Node Issue - Closed nodes not being removed Elasticsearch	5	382	July 6, 2017
Loss of Connection between Nodes Elasticsearch	5	745	July 6, 2017
Master node failure causes cluster to fail Elasticsearch	3	1680	July 6, 2017
ElasticSearch on EC2 - runs into problem recovering when one of the nodes times out then recovers Elasticsearch	2	360	July 6, 2017
Master_left and no other node elected to become master Elasticsearch	5	1100	July 6, 2017

Problem with master-nodes after resizing underlying instances (EC2 AWS)

Related topics