I'm trying to build a new, 6 node cluster in AWS (not the managed service, just the servers) and I'm getting errors around the nodes not being able to connect to each other.
All instances are in the same subnet, in the same VPC, managed by the same account. Security groups have tcp:9200-9400 for 0.0.0.0/0 ingress/egress (testing... not going to prod like this).
If I run curl -i http://localhost:9200
on the instance i get:
{
"name" : "01.01.01.001", << not the actual IP but the value that's there is the IP of the instance.
"cluster_name" : "elasticsearch",
"cluster_uuid" : "_na_",
"version" : {
"number" : "7.4.2",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
"build_date" : "2019-10-28T20:40:44.881551Z",
"build_snapshot" : false,
"lucene_version" : "8.2.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
if i run curl -i http://localhost:9200/_cluster/health?pretty
i get:
{
"error" : {
"root_cause" : [
{
"type" : "master_not_discovered_exception",
"reason" : null
}
],
"type" : "master_not_discovered_exception",
"reason" : null
},
"status" : 503
}
if i telnet from one host instance to another i get:
[root@ip-xx-xx-xx-xxx ec2-user]# telnet xx.xx.xx.xxx 22
Trying xx.xx.xx.xxx...
Connected to xx.xx.xx.xxx.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4
[root@ip-xx-xx-xx-xxx ec2-user]# telnet xx-xx-xx-xxx 9200
Trying xx-xx-xx-xxx...
telnet: connect to address xx-xx-xx-xxx: Connection refused
if i run netstat on the instance:
netstat -lan | grep 9200
tcp6 0 0 127.0.0.1:9200 :::* LISTEN
tcp6 0 0 ::1:9200 :::* LISTEN
elasticsearch.yml
cluster.name: elasticsearch
node.name: _ec2_ip_address_
node.master: true
node.data: true
node.ingest: false
#################################### Paths ####################################
# Path to directory containing configuration (this file and logging.yml):
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
logger.org.elasticsearch.discovery: TRACE
bootstrap.memory_lock: true
discovery.seed_providers: ec2
discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com
discovery.ec2.tag.Cluster: elasticsearch
error:
[2019-12-03T19:31:36,573][WARN ][o.e.c.c.ClusterFormationFailureHelper] [10.x.0.xx] master not discovered yet: have discovered [{10.x.0.xx}{RrYjUyCORh-TwhfrrZK6yw}{nNUd0DN2RF60R4kl6XghOw}{127.0.0.1}{127.0.0.1:9300}{l}{ml.machine_memory=8054112256, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.x.0.1x:9300, 10.x.0.2x:9300, 10.x.0.xx:9300, 10.x.0.4x:9300, 10.x.0.5x:9300] from hosts providers and [] from last-known cluster state; node term 0, last-accepted version 0 in term 0
[2019-12-03T19:31:36,986][DEBUG][o.e.d.PeerFinder ] [10.x.0.xx] Peer{transportAddress=127.0.0.1:9301, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:9301] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:976) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:161) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.4.2.jar:7.4.2]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2159) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9301
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:820) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) ~[?:?]
...
[2019-12-03T19:35:57,065][TRACE][o.e.d.HandshakingTransportAddressConnector] [10.x.0.1x] [connectToRemoteMasterNode[[::1]:9301]] opening probe connection
[2019-12-03T19:35:57,063][DEBUG][o.e.d.PeerFinder ] [10.x.0.1x] Peer{transportAddress=10.x.0.1x:9300, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [][10.x.0.1x:9300] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:976) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:161) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.4.2.jar:7.4.2]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2159) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.x.0.1x:9300
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:820) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) ~[?:?]
so it appears elastic can't connect on 9300 (to itself or to other nodes in the cluster) but I've reviewed the security groups a few times and nothing stands out as being off. Ports are open, and i even added a self-to-self rule just to be sure. All instances are confirmed to have the proper security group.
I do enable BBR TCP congestion control in the instance user_data:
# Enable BBR TCP congestion control
echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf
# reload sysctl to enable BBR
sysctl -p
Elasticsearch Version: 7.4.2
Kernel: 4.14.152-127.182.amzn2.x86_64
JAVA_VERSION="13.0.1"
JAVA_VERSION_DATE="2019-10-15"