Cluster unstable. Communication problem within cluster

#1

Hi,

I have a 5 node elasticsearch cluster (7.0.0). Nodes randomly leave the cluster and join back. This is happening with most of the nodes.

The logs on the nodes look something like this:

failed to connect to node {master-node}{eRSnrCA-QQ6v0CTfQzwSGw}{uCbIdgAPQqiIkke7eBu4pg}{master-node}{x.x.x.x:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [master-node][x.x.x.x:9300] connect_exception
        at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1299) ~[elasticsearch-7.0.0.jar:7.0.0]
        at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:99) ~[elasticsearch-7.0.0.jar:7.0.0]
        at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.0.0.jar:7.0.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2159) ~[?:?]
        at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.0.0.jar:7.0.0]
        at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$new$1(Netty4TcpChannel.java:72) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: master-node/x.x.x.x:9300
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        ... 6 more
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        ... 6 more
master node changed {previous [], current [{master-node}{eRSnrCA-QQ6v0CTfQzwSGw}{uCbIdgAPQqiIkke7eBu4pg}{master-node}{x.x.x.x:9300}]}, term: 7, version: 49163, reason: ApplyCommitRequest{term=7, version=49163, sourceNode={master-node}
#2

I restarted all nodes on the cluster. That seems to have solved the problem.