Multi node cluster failing to connect

vanwoes · March 1, 2023, 10:54am

Hi,

I'm having an issue with a multi node elasticsearch cluster where the nodes are failing to join in a docker swarm.

received join request from [{es01}{SBn0YXX-RyuPcEsz3vgdjA}{0l0I2h0HRteijUgnwwmvqg}{es01}{10.0.0.69}{10.0.0.69:9300}{hmrs}{xpack.installed=true}] but could not connect back to the joining node

error.message":"[es01][10.0.0.69:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es01][10.0.0.69:9300] connect_timeout[30s]

I'm able to get it working if I don't export any of the ports on es01, but I need to export as external services need to be able to connect to this elasticsearch cluster.

I'm using a very similar docker compose file listed here Install Elasticsearch with Docker | Elasticsearch Guide [8.6] | Elastic but using version 3.2 instead (not sure if that makes any difference?)

Are there any additional networking configurations that I need to add?

Thank you!

warkolm · March 1, 2023, 11:55pm

Welcome to our community!

It'd help if you provided configs and full logs.

rpd · March 16, 2023, 4:53am

Hello

I am also facing very similar issue.

My environment is as below: 3 master nodes, 3 ingest nodes, 18 data nodes

Docker Swarm on Debian 11 (Bullseye)
Docker Version: 23.0.1
Elasticsearch 7.17.6

Initially master nodes were not discovered although each container can curl the other on 9200 and 9300

After playing with discovery.seed_hosts and network.publish_host and transport.host, I was able to get the master election working.

As soon as the master is elected, I get below error on all master nodes:

<es_container> | "stacktrace": ["io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms",

Further following stacktrace keeps coming on all nodes:

<es_container> | "stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [<es_master_node>][10.2.252.86:9300] general node connection failure",

The general node connection failure is not specific to master nodes, it is occurring for all master, ingest and data nodes.

At the container level I have verified network connectivity.

Could this happen until the nodes recover their indices fully?

We have pretty heavy indices (about 150 GB each)

The same configuration was working earlier on docker version 20.x, but facing this issue with docker version 23.0.1 and es 7.17.6

Any help to troubleshoot this would be appreciated. How can I increase the debug output in elasticsearch logs ?

warkolm · March 19, 2023, 11:23pm

Please start your own topic for this

rpd · March 20, 2023, 4:29am

Yes thanks, apologies for hijacking the discussion! I did find a solution, so will post the same soon.

system · April 17, 2023, 4:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error: master not discovered yet , Elasticsearch cluster using docker swarm with three nodes on three separate servers Elasticsearch docker	2	220	January 7, 2024
Multiple nodes on elasticsearch Elasticsearch	11	842	November 21, 2018
Failed to send join request to master? Elasticsearch	1	1139	April 4, 2017
Getting ConnectTimeoutException When joining in cluster Even if nodes are reachable Elasticsearch	6	1255	October 1, 2021
Elastic node not connecting to master Elasticsearch	2	569	March 22, 2021

Multi node cluster failing to connect

Related topics