Nodes do not find the initial master node

jhonsouza · July 2, 2024, 7:40pm

Hi, guys!
I'm trying to run the Elasticsearch on AWS ECS. I have a problem at the moment, the nodes try to join the cluster. When the other nodes are in the same instance that the initial master node, they can to join the cluster, but when the other nodes are in a different instance, they are got not join to the cluster. Below the elasticsearch.yml I use.

Initial master node
cluster:
  name: '${STACK_NAME}'
  initial_master_nodes: 
    - '${HOSTNAME}-es-initial-master'
  routing:
    allocation:
      awareness: 
        attributes: aws_availability_zone
node:
  name: '${HOSTNAME}-es-initial-master'
  roles: [master]
path: 
  data: /usr/share/elasticsearch/data
  logs: /usr/share/elasticsearch/logs
bootstrap:
  memory_lock: true
discovery:
  zen: 
    minimum_master_nodes: 2
  ec2:
    endpoint: ec2.us-east-2.amazonaws.com
network:
  host: 0.0.0.0
xpack:
  monitoring:
    collection:
      interval: 10s
      enabled: true
  security:
    enabled: true
cloud:
  node:
    auto_attributes: true
http:
  cors:
    enabled: true
    allow-origin: "*"
ingest:
  geoip:
    downloader:
      enabled: false

Data node
cluster:
  name: '${STACK_NAME}'
node:
  name: '${HOSTNAME}-es-data-${SUFFIX}'
  roles: ["data"]
path: 
  data: /usr/share/elasticsearch/data
  logs: /usr/share/elasticsearch/logs
bootstrap:
  memory_lock: true
discovery:
  seed_providers: ec2
  ec2:
    endpoint: ec2.us-east-2.amazonaws.com
  seed_hosts: []
s3:
  client:
    default:
      endpoint: s3.us-east-2.amazonaws.com
network:
  host: 0.0.0.0
xpack:
  monitoring:
    collection:
      interval: 10s
      enabled: true
  security:
    enabled: true
cloud:
  node:
    auto_attributes: true
http:
  cors:
    enabled: true
    allow-origin: "*"
ingest:
  geoip:
    downloader:
      enabled: false

If someone can help me, I appreciate a lot of!!

DavidTurner · July 2, 2024, 9:01pm

See these docs for guidance about how to troubleshoot discovery problems, including the things to look for in logs etc. If you need help understanding your logs, please share them here.

jhonsouza · July 3, 2024, 2:02pm

Hi David. Thanks for the docs, this will be very helpful. I found this looking at the logs. The node find the eligible master, but can't complete the connection.

DavidTurner · July 3, 2024, 2:19pm

Yep that'd do it - these docs are what you need here.

jhonsouza · July 3, 2024, 2:29pm

Thanks a lot, David! I'll read this now!

jhonsouza · July 3, 2024, 7:43pm

If I configure the network.publish_host and network.bind_host with the value: 0.0.0.0. This is not should resolve my problem?

DavidTurner · July 4, 2024, 6:33am

0.0.0.0 can be fairly trappy for network.publish_host, especially if there's some kind of proxying or NAT going on as appears to be the case in your environment. I'd recommend being more specific. The log message you shared indicates that one possible step towards a resolution would be to specify network.publish_host: 172.30.5.137 on node deccdfd16c64-es-initial-master.

jhonsouza · July 4, 2024, 1:39pm

Thanks, @DavidTurner!! I changed the network.publish_host: 0.0.0.0 to network.publish_host: _ec2_. Now I receive a new error

DavidTurner · July 5, 2024, 7:51am

Would you copy the text of the errors (formatted with the </> button) rather than screenshots? Screenshots are pretty much unreadable here. And include the whole message, not just the few lines you screenshotted, because there's important detailed missing here.

jhonsouza · July 5, 2024, 7:27pm

Sorry for delay, here are the logs

{"type": "server", "timestamp": "2024-07-05T18:48:15,080Z", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "dev-es", "node.name": "es-kibana", "message": "failed to join {es-initial-master}{0uGjtOVrQ2qUspiQo9tMLw}{zzUo1K6cRwmdJpkuzBNBAw}{172.30.4.30}{172.30.4.30:9300}{m}{aws_availability_zone=us-east-2b, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={es-kibana}{2Slv2ZZdRQa1yNV7NPazyw}{XsiGsq4-QVqaMl44QKp8ig}{172.17.0.8}{172.17.0.8:9300}{dir}{aws_availability_zone=us-east-2a, xpack.installed=true, transform.node=false}, minimumTerm=1, optionalJoin=Optional[Join{term=1, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={es-kibana}{2Slv2ZZdRQa1yNV7NPazyw}{XsiGsq4-QVqaMl44QKp8ig}{172.17.0.8}{172.17.0.8:9300}{dir}{aws_availability_zone=us-east-2a, xpack.installed=true, transform.node=false}, targetNode={es-initial-master}{0uGjtOVrQ2qUspiQo9tMLw}{zzUo1K6cRwmdJpkuzBNBAw}{172.30.4.30}{172.30.4.30:9300}{m}{aws_availability_zone=us-east-2b, xpack.installed=true, transform.node=false}}]}", 
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-initial-master][172.17.0.10:9300][internal:cluster/coordination/join]",
"Caused by: org.elasticsearch.transport.ConnectTransportException: [es-kibana][172.17.0.8:9300] handshake failed. unexpected remote node {es-client-18d27a6938af}{oxS_gycZTI6_T_pjkLeXTw}{WZicVUy9SeibYfn0xkuAqg}{172.17.0.8}{172.17.0.8:9300}{r}{aws_availability_zone=us-east-2b, xpack.installed=true, transform.node=false}",
"at org.elasticsearch.transport.TransportService.lambda$connectionValidator$6(TransportService.java:468) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:95) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.TransportService.lambda$handshake$9(TransportService.java:577) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:352) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:340) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) ~[elasticsearch-7.17.15.jar:7.17.15]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]",
"at java.lang.Thread.run(Thread.java:1583) [?:?]"] }

DavidTurner · July 6, 2024, 3:59pm

Seems that you have two nodes that both claim to be at 172.17.0.8:9300, that's not going to work. Every node needs its own address.

DavidTurner · July 8, 2024, 4:38am

I opened Clarify logs/errors re. publish addresses by DaveCTurner · Pull Request #110570 · elastic/elasticsearch · GitHub to clarify this point, plus reword the log messages to be clearer and add links to the relevant docs.

jhonsouza · July 8, 2024, 8:54pm

Hi, @DavidTurner! Thank you so much for your helping in the troubleshooting. It's working now, I'll document here the solution to helping others there are these problems.

Topic		Replies	Views
Elasticsearch 6.2.4 nodes can’t discover each other in AWS Elasticsearch	2	1021	July 6, 2018
Node Discovery in elasticsearch on amazon EC2 Elasticsearch	17	4784	July 5, 2017
Discovery.ec2.host_type:public_dns not working as expected Elasticsearch	6	722	July 6, 2017
Elasticsearch 5.0.1 fails to be set as a cluster on AWS Elasticsearch	4	1328	December 16, 2016
Can't add Node to elasticsearch AWS Bitnami stacks Elasticsearch	3	422	August 22, 2018

Nodes do not find the initial master node

Related topics