[ERROR] Elastic 7.4.2 connection errors in AWS failing to join cluster

I'm trying to build a new, 6 node cluster in AWS (not the managed service, just the servers) and I'm getting errors around the nodes not being able to connect to each other.

All instances are in the same subnet, in the same VPC, managed by the same account. Security groups have tcp:9200-9400 for 0.0.0.0/0 ingress/egress (testing... not going to prod like this).

If I run curl -i http://localhost:9200on the instance i get:

{
  "name" : "01.01.01.001", << not the actual IP but the value that's there is the IP of the instance.
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "_na_",
  "version" : {
    "number" : "7.4.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
    "build_date" : "2019-10-28T20:40:44.881551Z",
    "build_snapshot" : false,
    "lucene_version" : "8.2.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

if i run curl -i http://localhost:9200/_cluster/health?pretty i get:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

if i telnet from one host instance to another i get:

[root@ip-xx-xx-xx-xxx ec2-user]# telnet xx.xx.xx.xxx 22
Trying  xx.xx.xx.xxx...
Connected to  xx.xx.xx.xxx.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4

[root@ip-xx-xx-xx-xxx ec2-user]# telnet xx-xx-xx-xxx 9200
Trying xx-xx-xx-xxx...
telnet: connect to address xx-xx-xx-xxx: Connection refused

if i run netstat on the instance:

netstat -lan | grep 9200
tcp6       0      0 127.0.0.1:9200          :::*                    LISTEN     
tcp6       0      0 ::1:9200                :::*                    LISTEN  

elasticsearch.yml


cluster.name: elasticsearch
node.name: _ec2_ip_address_

node.master: true
node.data: true
node.ingest: false

#################################### Paths ####################################

# Path to directory containing configuration (this file and logging.yml):

path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

logger.org.elasticsearch.discovery: TRACE

bootstrap.memory_lock: true

discovery.seed_providers: ec2
discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com
discovery.ec2.tag.Cluster: elasticsearch

error:

[2019-12-03T19:31:36,573][WARN ][o.e.c.c.ClusterFormationFailureHelper] [10.x.0.xx] master not discovered yet: have discovered [{10.x.0.xx}{RrYjUyCORh-TwhfrrZK6yw}{nNUd0DN2RF60R4kl6XghOw}{127.0.0.1}{127.0.0.1:9300}{l}{ml.machine_memory=8054112256, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.x.0.1x:9300, 10.x.0.2x:9300, 10.x.0.xx:9300, 10.x.0.4x:9300, 10.x.0.5x:9300] from hosts providers and [] from last-known cluster state; node term 0, last-accepted version 0 in term 0

[2019-12-03T19:31:36,986][DEBUG][o.e.d.PeerFinder         ] [10.x.0.xx] Peer{transportAddress=127.0.0.1:9301, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:9301] connect_exception
        at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:976) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:161) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.4.2.jar:7.4.2]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2159) ~[?:?]
        at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9301
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:820) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) ~[?:?]
...
[2019-12-03T19:35:57,065][TRACE][o.e.d.HandshakingTransportAddressConnector] [10.x.0.1x] [connectToRemoteMasterNode[[::1]:9301]] opening probe connection
[2019-12-03T19:35:57,063][DEBUG][o.e.d.PeerFinder         ] [10.x.0.1x] Peer{transportAddress=10.x.0.1x:9300, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [][10.x.0.1x:9300] connect_exception
        at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:976) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:161) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.4.2.jar:7.4.2]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2159) ~[?:?]
        at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.x.0.1x:9300
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:820) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) ~[?:?]

so it appears elastic can't connect on 9300 (to itself or to other nodes in the cluster) but I've reviewed the security groups a few times and nothing stands out as being off. Ports are open, and i even added a self-to-self rule just to be sure. All instances are confirmed to have the proper security group.

I do enable BBR TCP congestion control in the instance user_data:

# Enable BBR TCP congestion control
echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf

# reload sysctl to enable BBR
sysctl -p 

Elasticsearch Version: 7.4.2
Kernel: 4.14.152-127.182.amzn2.x86_64
JAVA_VERSION="13.0.1"
JAVA_VERSION_DATE="2019-10-15"

To confirm the security groups are valid, I installed an apache webserver on one of the instances in the cluster and ran it on ports 9200 and 9300 and was able to telnet it from another instance in the cluster/security group on those ports.

So the issue appears to be the way elasticsearch is binding to the host network... or something.

In hindsight, not reading the elasticsearch.yml docs more carefully was a poor choice on my part.

The issue was that I was missing: network.host: 0.0.0.0 in my elasticsearch.yml

Hope this ends up helping someone else with a similar issue, at least.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.