Nodes do not find the initial master node

Hi, guys!
I'm trying to run the Elasticsearch on AWS ECS. I have a problem at the moment, the nodes try to join the cluster. When the other nodes are in the same instance that the initial master node, they can to join the cluster, but when the other nodes are in a different instance, they are got not join to the cluster. Below the elasticsearch.yml I use.

Initial master node
cluster:
  name: '${STACK_NAME}'
  initial_master_nodes: 
    - '${HOSTNAME}-es-initial-master'
  routing:
    allocation:
      awareness: 
        attributes: aws_availability_zone
node:
  name: '${HOSTNAME}-es-initial-master'
  roles: [master]
path: 
  data: /usr/share/elasticsearch/data
  logs: /usr/share/elasticsearch/logs
bootstrap:
  memory_lock: true
discovery:
  zen: 
    minimum_master_nodes: 2
  ec2:
    endpoint: ec2.us-east-2.amazonaws.com
network:
  host: 0.0.0.0
xpack:
  monitoring:
    collection:
      interval: 10s
      enabled: true
  security:
    enabled: true
cloud:
  node:
    auto_attributes: true
http:
  cors:
    enabled: true
    allow-origin: "*"
ingest:
  geoip:
    downloader:
      enabled: false


Data node
cluster:
  name: '${STACK_NAME}'
node:
  name: '${HOSTNAME}-es-data-${SUFFIX}'
  roles: ["data"]
path: 
  data: /usr/share/elasticsearch/data
  logs: /usr/share/elasticsearch/logs
bootstrap:
  memory_lock: true
discovery:
  seed_providers: ec2
  ec2:
    endpoint: ec2.us-east-2.amazonaws.com
  seed_hosts: []
s3:
  client:
    default:
      endpoint: s3.us-east-2.amazonaws.com
network:
  host: 0.0.0.0
xpack:
  monitoring:
    collection:
      interval: 10s
      enabled: true
  security:
    enabled: true
cloud:
  node:
    auto_attributes: true
http:
  cors:
    enabled: true
    allow-origin: "*"
ingest:
  geoip:
    downloader:
      enabled: false

If someone can help me, I appreciate a lot of!!

See these docs for guidance about how to troubleshoot discovery problems, including the things to look for in logs etc. If you need help understanding your logs, please share them here.

Hi David. Thanks for the docs, this will be very helpful. I found this looking at the logs. The node find the eligible master, but can't complete the connection.

Yep that'd do it - these docs are what you need here.

Thanks a lot, David! I'll read this now!

If I configure the network.publish_host and network.bind_host with the value: 0.0.0.0. This is not should resolve my problem?

0.0.0.0 can be fairly trappy for network.publish_host, especially if there's some kind of proxying or NAT going on as appears to be the case in your environment. I'd recommend being more specific. The log message you shared indicates that one possible step towards a resolution would be to specify network.publish_host: 172.30.5.137 on node deccdfd16c64-es-initial-master.

Thanks, @DavidTurner!! I changed the network.publish_host: 0.0.0.0 to network.publish_host: _ec2_. Now I receive a new error

Would you copy the text of the errors (formatted with the </> button) rather than screenshots? Screenshots are pretty much unreadable here. And include the whole message, not just the few lines you screenshotted, because there's important detailed missing here.

Sorry for delay, here are the logs

{"type": "server", "timestamp": "2024-07-05T18:48:15,080Z", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "dev-es", "node.name": "es-kibana", "message": "failed to join {es-initial-master}{0uGjtOVrQ2qUspiQo9tMLw}{zzUo1K6cRwmdJpkuzBNBAw}{172.30.4.30}{172.30.4.30:9300}{m}{aws_availability_zone=us-east-2b, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={es-kibana}{2Slv2ZZdRQa1yNV7NPazyw}{XsiGsq4-QVqaMl44QKp8ig}{172.17.0.8}{172.17.0.8:9300}{dir}{aws_availability_zone=us-east-2a, xpack.installed=true, transform.node=false}, minimumTerm=1, optionalJoin=Optional[Join{term=1, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={es-kibana}{2Slv2ZZdRQa1yNV7NPazyw}{XsiGsq4-QVqaMl44QKp8ig}{172.17.0.8}{172.17.0.8:9300}{dir}{aws_availability_zone=us-east-2a, xpack.installed=true, transform.node=false}, targetNode={es-initial-master}{0uGjtOVrQ2qUspiQo9tMLw}{zzUo1K6cRwmdJpkuzBNBAw}{172.30.4.30}{172.30.4.30:9300}{m}{aws_availability_zone=us-east-2b, xpack.installed=true, transform.node=false}}]}", 
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-initial-master][172.17.0.10:9300][internal:cluster/coordination/join]",
"Caused by: org.elasticsearch.transport.ConnectTransportException: [es-kibana][172.17.0.8:9300] handshake failed. unexpected remote node {es-client-18d27a6938af}{oxS_gycZTI6_T_pjkLeXTw}{WZicVUy9SeibYfn0xkuAqg}{172.17.0.8}{172.17.0.8:9300}{r}{aws_availability_zone=us-east-2b, xpack.installed=true, transform.node=false}",
"at org.elasticsearch.transport.TransportService.lambda$connectionValidator$6(TransportService.java:468) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:95) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.TransportService.lambda$handshake$9(TransportService.java:577) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:352) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:340) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]",
"at java.lang.Thread.run(Thread.java:1583) [?:?]"] }

Seems that you have two nodes that both claim to be at 172.17.0.8:9300, that's not going to work. Every node needs its own address.

I opened Clarify logs/errors re. publish addresses by DaveCTurner · Pull Request #110570 · elastic/elasticsearch · GitHub to clarify this point, plus reword the log messages to be clearer and add links to the relevant docs.

1 Like

Hi, @DavidTurner! Thank you so much for your helping in the troubleshooting. It's working now, I'll document here the solution to helping others there are these problems.

1 Like