Cluster not forming

I’m trying to get an ES cluster working with a cloud formation. I have it all up and running but the cluster is not forming correctly.

This is the error that I’m getting:

{"@timestamp":"2025-02-14T22:02:57.598Z", "log.level": "INFO", "message":"close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/127.0.0.1:37302, remoteAddress=es02.elasticsearch.local/127.255.0.2:9300, profile=default}], disconnecting from relevant node: Connection reset", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es01][transport_worker][T#1]","log.logger":"org.elasticsearch.transport.TcpTransport","elasticsearch.node.name":"es01","elasticsearch.cluster.name":"docker-cluster”}

This is the template that I am using. You just need to provide the PrivateSubnetIds, PublicSubnetIds and ECSTaskExecutionRoleArn

Any ideas?

I feel like Cluster name setting is missing

The container defaults to "docker-cluster". I overrode this just to test it out with "cluster01" and didn't fix the issue.

I have zero idea on cloud formation, but the elastic docs say:

You must set cluster.initial_master_nodes to the same list of nodes on each node on which it is set in order to be sure that only a single cluster forms during bootstrapping. If cluster.initial_master_nodes varies across the nodes on which it is set then you may bootstrap multiple clusters.

In your config, this setting varies.

Also, I dont know if after your cluster fails to come up properly, do all the nodes stay up? if so, then can you login and do some troubleshooting from there? i.e. does any (how many?) of the 3 nodes think it's part of a cluster, and if so what does it think its cluster is composed of?

I fixed it and tried with just one node, I get the same. The nodes stay up and running but they just log the connection errors. I confirmed that they can communicate with each other:

nc -zv es02.elasticsearch.local 9200
Connection to es02.elasticsearch.local 9200 port [tcp/] succeeded!
nc -zv es02.elasticsearch.local 9300
Connection to es02.elasticsearch.local 9300 port [tcp/
] succeeded!

  ContainerDefinitions:
    - Name: es01
      Cpu: !Ref ContainerCpu
      Memory: !Ref ContainerMemory
      Image: !Ref ImageUrl
      PortMappings:
        - ContainerPort: !Ref ContainerPort
          HostPort: !Ref ContainerPort
          Protocol: tcp
          Name: "api"
      LogConfiguration:
        LogDriver: awslogs
        Options:
          mode: non-blocking
          max-buffer-size: 25m
          awslogs-group: !Ref LogGroup
          awslogs-region: !Ref AWS::Region
          awslogs-stream-prefix: es01
      Ulimits:
        - Name: memlock
          SoftLimit: -1
          HardLimit: -1
      Environment:
        - Name: discovery.type
          Value: multi-node
        - Name: cluster.name
          value: cluster01
        - Name: node.name
          Value: "es01"
        - Name: cluster.initial_master_nodes
          Value: "es01"
        - Name: discovery.seed_hosts
          Value: "es01.elasticsearch.local"
        - Name: discovery.cluster_formation_warning_timeout
          Value: "10m"
        - Name: ES_JAVA_OPTS
          Value: "-Xms6g -Xmx6g"
        - Name: xpack.security.enabled
          Value: false
        - Name: xpack.security.transport.ssl.enabled
          Value: false
        - Name: xpack.security.http.ssl.enabled
          Value: false
        - Name: xpack.security.authc.api_key.enabled
          Value: false
        - Name: xpack.security.authc.realms.native.native1.enabled
          Value: true
        - Name: xpack.security.authc.realms.native.native1.order
          Value: 0
        - Name: action.destructive_requires_name
          Value: false

When you start elasticsearch on one of the nodes, it should spit out all kinds of startup logs either to console, or log file. Share them here.

I had meant to use, eg, curl and do

curl -X GET -s -k http://localhost:9200/_cluster/health

(or the local IP address)

on all 3 hosts.

But in end, for me anyways, you have a cloud formation issue (I dont really understand the syntax and how it uses it). For my 3-node docker cluster, I have the following in the compose file:

    environment:
      - node.name=es01
      - cluster.name=${CLUSTER_NAME}
      - cluster.initial_master_nodes=es01,es02,es03
      - discovery.seed_hosts=es02,es03

    environment:
      - node.name=es02
      - cluster.name=${CLUSTER_NAME}
      - cluster.initial_master_nodes=es01,es02,es03
      - discovery.seed_hosts=es01,es03

    environment:
      - node.name=es03
      - cluster.name=${CLUSTER_NAME}
      - cluster.initial_master_nodes=es01,es02,es03
      - discovery.seed_hosts=es01,es02

couple of little things

AWSTemplateFormatVersion: 2010-09-09

not important, but just the 15 years have passed since then . I thought AWS moved things along faster than that ...

2nd, why use 8.15.1? This reads to me like a new installations, why not use 8.17.2 aka latest ?

3rd, no security, not SSL, in-the-clear-HTTP, ... ? It its just POC then ... but so many times I've seen POC become production in a heartbeat.

            - Name: node.name
              Value: "es01"
            - Name: cluster.name
              value: cluster01
            - Name: cluster.initial_master_nodes
              Value: "es01,es02,es03"
            - Name: discovery.seed_hosts
              Value: "es02.elasticsearch.local,es03.elasticsearch.local"

            - Name: node.name
              Value: "es02"
            - Name: cluster.name
              value: cluster01
            - Name: cluster.initial_master_nodes
              Value: "es01,es02,es03"
            - Name: discovery.seed_hosts
              Value: "es01.elasticsearch.local,es03.elasticsearch.local"

            - Name: cluster.name
              value: cluster01
            - Name: node.name
              Value: "es03"
            - Name: cluster.initial_master_nodes
              Value: "es01,es02,es03"
            - Name: discovery.seed_hosts
              Value: "es01.elasticsearch.local,es02.elasticsearch.local"

For the config, I have similar. The containers run on different servers and need hostnames that ES will get the IPs.

I will add the extra security stuff if I get this working. I'm trying to reduce other potential issues. Thanks for the input.

I can't attach the logs here but they are in the SB link earlier. (the forum flagged it as spam so I can't link it again")

The nodes are on different EC2 instances. I'm going to try the network.host setting.

Actually, you had different combinations.

For the logs, elasticsearch is verbose when starting up first time. Share those logs please.

I did. Can you access them?

If you mean the post that shows as “This post was flagged by the community and is temporarily hidden.”, then no.

Double check your logs don’t contain anything dodgy, and try again. Or put on pastebin or similar.

The logs are there in the link from the OP, and say stuff like this:

1739909923928,"{""@timestamp"":""2025-02-18T20:18:43.927Z"", ""log.level"": ""WARN"", ""message"":""address [127.255.0.3:9300], node [unknown discovery result: [][127.255.0.3:9300] general node connection failure: handshake failed because connection reset; for summary, see logs from org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.15/discovery-troubleshooting.html"", ""ecs.version"": ""1.2.0"",""service.name"":""ES_ECS"",""event.dataset"":""elasticsearch.server"",""process.thread.name"":""elasticsearch[es02][generic][T#2]"",""log.logger"":""org.elasticsearch.discovery.PeerFinder"",""elasticsearch.node.name"":""es02"",""elasticsearch.cluster.name"":""cluster01""}"
1739909924927,"{""@timestamp"":""2025-02-18T20:18:44.927Z"", ""log.level"": ""INFO"", ""message"":""close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/127.0.0.1:55376, remoteAddress=es03.elasticsearch.local/127.255.0.3:9300, profile=default}], disconnecting from relevant node: Connection reset"", ""ecs.version"": ""1.2.0"",""service.name"":""ES_ECS"",""event.dataset"":""elasticsearch.server"",""process.thread.name"":""elasticsearch[es02][transport_worker][T#2]"",""log.logger"":""org.elasticsearch.transport.TcpTransport"",""elasticsearch.node.name"":""es02"",""elasticsearch.cluster.name"":""cluster01""}"

Connection reset means that ES opened a connection and then something outside of ES forced it to close with a RST packet. Quite possibly the RST happens in reaction to some data being sent between the nodes. That is consistent with nc -zv es02.elasticsearch.local 9300 reporting success because this command sends no data, so it's not getting as far as the problematic step.

You need to look at your network infra to find out what is sending those RST packets to abort these inter-node connections.

Now yes, last night no, blocked for some (likely spurious) reason.

Noting these long were from some days ago, but ...

es01 gets to the point where it logs

{"@timestamp":"2025-02-18T20:54:51.069Z","log.level":"INFO","message":"publish_address {172.30.5.223:9300}, bound_addresses {0.0.0.0:9300}","ecs.version":"1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.transport.TransportService","elasticsearch.node.name":"es01","elasticsearch.cluster.name":"cluster01"}

Notice the publish address is 172.30.5.223, on a private network.

But the other addresses in the 3 log files are

127.0.0.1
and
127.255.0.1/.2/.3 which appear to be what es01/2/3 are resolving to?

e.g. from logs like

localAddress=/127.0.0.1:57402
remoteAddress=es03.elasticsearch.local/127.255.0.3:9300

Now I'm not a network specialist, but should traffic to 127.x.y.z even leave the host at all?

On my linux host

$ sudo lsof -i :9200
COMMAND   PID          USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
java    18591 elasticsearch  557u  IPv6 1300989      0t0  TCP *:9200 (LISTEN)

$ nc -rz 127.0.0.1 9200
Connection to 127.0.0.1 9200 port [tcp/*] succeeded!

$ nc -rz 127.23.53.2 9200
Connection to 127.23.53.2 9200 port [tcp/*] succeeded!

All those "connections" are internal within the local host, I can change to any 127.x.y.z and I'll get exactly same.

I think part of issue here is that es01/2/3 should probably resolve to addresses like 172.30.5.223, and not 127.x.y,z, on all 3 hosts. Just a guess of course.

You are right that these are unusual addresses, but IME cloud/container environments can do arbitrarily weird stuff to the network config so it's best not to make any assumptions about such things. The quickest and most reliable way to resolve this will be to break out tcpdump or Wireshark or similar and locate the source of those RST packets. Everything else is just guesswork.

It may also help to read this section of the docs particularly since ES seems to be running in a multi-homed context:

It is usually a mistake to use 0.0.0.0 as a publish address on hosts with more than one network interface.

David, I didn't make any assumptions, I merely (implicitly) asked @Joel to check the name resolutions are what he expects.

And honestly, I think if they are resolving to 127.x.y.z addresses it's not right, but that isn't an assumption, it's speculation, though RFC5735 says (other RFCs have similar)

addresses within the entire 127.0.0.0/8 block do not legitimately appear on any network anywhere

Troubleshooting on limited information is hard, it's pretty hard with all the information to hand too. If we need take into account "arbitrarily weird stuff" it gets ...