Multi node cluster failing to connect

Hi,

I'm having an issue with a multi node elasticsearch cluster where the nodes are failing to join in a docker swarm.

received join request from [{es01}{SBn0YXX-RyuPcEsz3vgdjA}{0l0I2h0HRteijUgnwwmvqg}{es01}{10.0.0.69}{10.0.0.69:9300}{hmrs}{xpack.installed=true}] but could not connect back to the joining node
error.message":"[es01][10.0.0.69:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es01][10.0.0.69:9300] connect_timeout[30s]

I'm able to get it working if I don't export any of the ports on es01, but I need to export as external services need to be able to connect to this elasticsearch cluster.

I'm using a very similar docker compose file listed here Install Elasticsearch with Docker | Elasticsearch Guide [8.6] | Elastic but using version 3.2 instead (not sure if that makes any difference?)

Are there any additional networking configurations that I need to add?

Thank you!

Welcome to our community! :smiley:

It'd help if you provided configs and full logs.

Hello

I am also facing very similar issue.

My environment is as below: 3 master nodes, 3 ingest nodes, 18 data nodes

Docker Swarm on Debian 11 (Bullseye)
Docker Version: 23.0.1
Elasticsearch 7.17.6

Initially master nodes were not discovered although each container can curl the other on 9200 and 9300

After playing with discovery.seed_hosts and network.publish_host and transport.host, I was able to get the master election working.

As soon as the master is elected, I get below error on all master nodes:

<es_container> | "stacktrace": ["io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms",

Further following stacktrace keeps coming on all nodes:

<es_container> | "stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [<es_master_node>][10.2.252.86:9300] general node connection failure",

The general node connection failure is not specific to master nodes, it is occurring for all master, ingest and data nodes.

At the container level I have verified network connectivity.

Could this happen until the nodes recover their indices fully?

We have pretty heavy indices (about 150 GB each)

The same configuration was working earlier on docker version 20.x, but facing this issue with docker version 23.0.1 and es 7.17.6

Any help to troubleshoot this would be appreciated. How can I increase the debug output in elasticsearch logs ?

Please start your own topic for this :slight_smile:

Yes thanks, apologies for hijacking the discussion! I did find a solution, so will post the same soon.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.