Can't find master nodes after node restart

After upgrading to 7.7.1, the data nodes can't find the master nodes after a restart of the data node.
I am using EC2 discovery plugin, and on the initial startup it joins the cluster as expected, but once restarted in says it doesn't discover master nodes though they are listed in the list of IP's:

[2020-08-16T10:51:33,380][WARN ][o.e.d.HandshakingTransportAddressConnector] [ip-172-30-1-7.ec2.internal] handshake failed for [connectToRemoteMasterNode[172.30.2.153:9300]]
org.elasticsearch.transport.SendRequestTransportException: [][172.30.2.153:9300][internal:transport/handshake]
        at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:719) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.sendWithUser(SecurityServerTransportInterceptor.java:162) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.access$300(SecurityServerTransportInterceptor.java:53) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$1.lambda$sendRequest$0(SecurityServerTransportInterceptor.java:114) ~[?:?]
        at org.elasticsearch.xpack.core.security.SecurityContext.executeAsUser(SecurityContext.java:127) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$1.sendRequest(SecurityServerTransportInterceptor.java:114) ~[?:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:621) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.transport.TransportService.handshake(TransportService.java:458) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.transport.TransportService.handshake(TransportService.java:436) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.discovery.HandshakingTransportAddressConnector$1$1.onResponse(HandshakingTransportAddressConnector.java:95) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.discovery.HandshakingTransportAddressConnector$1$1.onResponse(HandshakingTransportAddressConnector.java:88) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:163) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:98) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.1.jar:7.7.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.node.NodeClosedException: node closed {ip-172-30-1-7.ec2.internal}{E2NEDQZWQ9Kf6G0pquFvfw}{KIjua5LDT-KpwNEpHt0bYg}{172.30.1.7}{172.30.1.7:9300}{dilrt}{aws_availability_zone=us-east-1b
, ml.machine_memory=66715250688, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
        at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:701) ~[elasticsearch-7.7.1.jar:7.7.1]
        ... 17 more

And then it get stuck in this cycle:

[2020-08-16T10:52:09,845][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ip-172-30-1-7.ec2.internal] master not discovered yet: have discovered [{ip-172-30-1-7.ec2.internal}{E2NEDQZWQ9Kf6G0pquFvfw}{jj2zLzYHR8GU-JiZk2FScw}{172.17.0.1}{172.17.0.1:9300}{dilrt}{aws_availability_zone=us-east-1b, ml.machine_memory=66715250688, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {ip-172-30-0-123.ec2.internal}{vPn-iGVfQdWH_PCebTHUiQ}{a_Zb72LGSdiT92cjpwPaeg}{172.30.0.123}{172.30.0.123:9300}{lm}{aws_availability_zone=us-east-1a, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {ip-172-30-2-62.ec2.internal}{sMuj5JCHQNKcNfUZdovHfA}{Sj6ihWVYRh6gpDx9yvXwhw}{172.30.2.62}{172.30.2.62:9300}{lm}{aws_availability_zone=us-east-1c, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {ip-172-30-1-9.ec2.internal}{I7bTPetXQxmLz_a7maEWFw}{EGbpYlR7TvucszK5IaauYw}{172.30.1.9}{172.30.1.9:9300}{lm}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {ip-172-30-2-153.ec2.internal}{5M30QMotQhWEER6NEm_wiw}{NdlEK_tURk6VzwPdUnKSOQ}{172.30.2.153}{172.30.2.153:9300}{lm}{aws_availability_zone=us-east-1c, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {ip-172-30-0-170.ec2.internal}{3uWMMAZRRO2e_0aOYpELHA}{OCck7fwqSXuZ6kC4amioZw}{172.30.0.170}{172.30.0.170:9300}{lmr}{aws_availability_zone=us-east-1a, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 172.30.0.123:9300, 172.30.2.62:9300, 172.30.1.9:9300, 172.30.2.153:9300, 172.30.2.214:9300, 172.30.2.243:9300, 172.30.0.4:9300, 172.30.0.170:9300] from hosts providers and [] from last-known cluster state; node term 83, last-accepted version 1086389 in term 83

Any help will be greatly appreciated

Are you sure the currently-elected master node appears in the list of discovered nodes? Which one is it?

If so, I think there will be some other log messages too indicating that it is failing to join it for some reason. They might not be very frequent. Can you check for them too?

There's the connection error from the first log I attached.

Also, after setting log level to debug, this is the output (note all the master nodes are listed there - using dynamic transport addresses [172.30.0.123:9300, 172.30.2.62:9300, 172.30.1.9:9300, 172.30.2.153:9300, 172.30.2.
> 214:9300, 172.30.2.243:9300, 172.30.0.4:9300, 172.30.0.170:9300):

[2020-08-16T11:01:44,644][DEBUG][o.a.h.i.c.PoolingHttpClientConnectionManager] [ip-172-30-1-7.ec2.internal] Connection [id: 0][route: {s}->https://ec2.us-east-1.amazonaws.com:443] can be kept alive for 60.0 secon
ds
[2020-08-16T11:01:44,644][DEBUG][o.a.h.i.c.DefaultManagedHttpClientConnection] [ip-172-30-1-7.ec2.internal] http-outgoing-0: set socket timeout to 0
[2020-08-16T11:01:44,644][DEBUG][o.a.h.i.c.PoolingHttpClientConnectionManager] [ip-172-30-1-7.ec2.internal] Connection released: [id: 0][route: {s}->https://ec2.us-east-1.amazonaws.com:443][total kept alive: 1; r
oute allocated: 1 of 50; total allocated: 1 of 50]
[2020-08-16T11:01:44,647][DEBUG][o.e.d.e.AwsEc2SeedHostsProvider] [ip-172-30-1-7.ec2.internal] using dynamic transport addresses [172.30.0.123:9300, 172.30.2.62:9300, 172.30.1.9:9300, 172.30.2.153:9300, 172.30.2.
214:9300, 172.30.2.243:9300, 172.30.0.4:9300, 172.30.0.170:9300]
[2020-08-16T11:01:44,681][DEBUG][o.e.d.PeerFinder ] [ip-172-30-1-7.ec2.internal] Peer{transportAddress=[::1]:9301, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [[::1]:9301] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:998) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2152) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /[0:0:0:0:0:0:0:1]:9301
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:589) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:839) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
... 7 more
[2020-08-16T11:01:44,681][DEBUG][o.e.d.PeerFinder ] [ip-172-30-1-7.ec2.internal] Peer{transportAddress=[::1]:9304, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [[::1]:9304] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:998) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2152) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /[0:0:0:0:0:0:0:1]:9304
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:589) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:839) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
... 7 more
[2020-08-16T11:01:44,682][DEBUG][o.e.d.PeerFinder ] [ip-172-30-1-7.ec2.internal] Peer{transportAddress=127.0.0.1:9303, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [127.0.0.1:9303] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:998) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2152) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9303
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:589) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:839) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
... 7 more

Nope, that's not helpful at all, that's just saying it can't connect to anything at nonstandard ports like 127.0.0.1:9303 which is hardly surprising. I don't think you need DEBUG logs here.

Similarly the NodeClosedException from the first message is also uninformative, it just means the node was shutting down during the connection.

Go back to the default logging config to get rid of all this junk and look for messages about problems joining the cluster.

Also, repeating my first question:

What does GET _cat/master return?

  1. When the node is in the cluster and everything works fine:

curl localhost:9200/_cat/master
3uWMMAZRRO2e_0aOYpELHA 172.30.0.170 172.30.0.170 ip-172-30-0-170.ec2.internal

  1. restarted elasticsearch:

systemctl restart elasticsearch

  1. Elasticsearch running. Logs:
    elasticsearch log · GitHub

Something looks very wrong with your logging config -- there are no INFO messages, for instance. You'll need to see those.

Can we see the logs (including INFO logs) from the master too?

You were right David, I misconfigured the logs.
You can see the data + master logs here:

Ah, that's better. Join failures are only logged at INFO level since they're not always a problem, but here they are.

[2020-08-17T08:08:53,617][INFO ][o.e.c.c.JoinHelper       ] [ip-172-30-1-61.ec2.internal] failed to join {ip-172-30-0-170.ec2.internal}{3uWMMAZRRO2e_0aOYpELHA}{OCck7fwqSXuZ6kC4amioZw}{172.30.0.170}{172.30.0.170:9300}{lmr}{aws_availability_zone=us-east-1a, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={ip-172-30-1-61.ec2.internal}{heumpdUlSRezT69xwljSgg}{3NbMrFIKTRylx4c9z1xnQg}{172.17.0.1}{172.17.0.1:9300}{dilrt}{aws_availability_zone=us-east-1b, ml.machine_memory=67534430208, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=83, optionalJoin=Optional.empty}
org.elasticsearch.transport.RemoteTransportException: [ip-172-30-0-170.ec2.internal][172.30.0.170:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.transport.ConnectTransportException: [ip-172-30-1-61.ec2.internal][172.17.0.1:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1004) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.7.1.jar:7.7.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

The master node ip-172-30-0-170.ec2.internal cannot connect to the data node ip-172-30-1-61.ec2.internal -- the data node claims its address is 172.17.0.1:9300 but that doesn't match the node name.

1 Like

Thanks David.
It incorrectly published the docker bridge ip. Shouldn't it pick the eth0 interface by default instead of docker0?

I added network.publish_host: <<private_ip>> to the elasticsearch.yml to fix it.

Are you running these nodes within Docker using a bridge network? The Docker docs say not to do that:

Bridge networks apply to containers running on the same Docker daemon host. For communication among containers running on different Docker daemon hosts, you can either manage routing at the OS level, or you can use an overlay network.

You can use the special value _ec2:privateIpv4_ instead of having to work out what the IP address is yourself.

No, the nodes are not running on containers, that's why I was curious about using this network :slight_smile:
I do have docker running, but it is just running a container of elasticsearch exporter for prometheus.

Gotcha. In which case you don't want to set network.publish_host at all, you should set network.host: _ec2:privateIpv4_ instead.

1 Like

Noted. Thanks a lot

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.