Elasticsearch data node fail to join the cluster

zozo6015 · December 7, 2017, 3:55pm

Hello

I have an elasticsearch 6.0.0 cluster running on GCE environment and I want to add a data node but this node is failing to join the cluster.

I am getting the following in the logs:

2017-12-07T15:50:42,782][WARN ][o.e.n.Node               ] [elastic-data-4] timed out while waiting for initial discovery state - timeout: 30s
[2017-12-07T15:50:42,807][INFO ][o.e.h.n.Netty4HttpServerTransport] [elastic-data-4] publish_address {10.0.2.10:9200}, bound_addresses {[::]:9200}
[2017-12-07T15:50:42,807][INFO ][o.e.n.Node               ] [elastic-data-4] started
[2017-12-07T15:51:15,953][INFO ][o.e.d.z.ZenDiscovery     ] [elastic-data-4] failed to send join request to     master [{elastic-master-2}{3MYOSUphRnSa0dGIB8cjFQ}{lfaoDLVQSlKRZPUumEYclA}{10.0.2.3}{10.0.2.3:9300}], reason [ElasticsearchTimeou
tException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

the node can curl any other nodes on port 9200 or 9300 therefore it is not a network issue.

Any suggestion would be helpfull.

dadoonet · December 8, 2017, 12:38pm

Are you using the GCE discovery Plugin?

I believe you are not, but prefer asking.

What are your config files for both nodes?

zozo6015 · December 8, 2017, 3:24pm

No I am not using GCE discovery plugin. The config on all 4 data nodes are the same.

cluster.name: CLUSTERNAME
node.name: ${HOSTNAME}
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["10.0.2.3", "10.0.2.4", "10.0.2.5", "10.0.2.6", "10.0.2.9", "10.0.2.2", "10.0.2.10"}]
discovery.zen.minimum_master_nodes: 2
path.data: /var/lib/elasticsearch/data-1, /var/lib/elasticsearch/data-2, /var/lib/elasticsearch/data-3, /var/lib/elasticsearch/data-4, /var/lib/elasticsearch/data-5, /var/lib/elasticsearch/data-6, /var/lib/elasticsearch/data-7, /var/lib/elasticsearch/data-8
node.master: false
node.data: true
node.ingest: false

dadoonet · December 8, 2017, 6:25pm

What gives GET _cat/nodes?v ?

zozo6015 · December 8, 2017, 6:48pm

Nothing, it just hangs.

dadoonet · December 8, 2017, 7:48pm

Uh?

So please describe exactly your configuration. How many nodes do you have in total? Which are the master nodes? Are they started? If so share the logs please.

zozo6015 · December 9, 2017, 11:19am

It turned out that the new node had some disk issues into the LVM and that made elasticsearch act crazy. We have changed the data.path strategy from using LVM to multiplie path and that made the whole thing stable again and is working for almost 2 days.

Thank you for your time.

zozo6015 · December 12, 2017, 6:26am

It happened again when a node fell out of the cluster for some sync failure. https://pastebin.com/m8P5LJNF

This time I can see some input/output error but I don't think it is the disk related. After resetting the node it never managed to get in touch with the elasticsearch cluster again.

Regards,

zozo6015 · December 12, 2017, 6:46am

To answer to your previous questions. We have 4 data nodes and 3 master nodes. All master nodes are up and running and they never had any issues. Also all the master nodes and 3 of the data nodes are using default-jvm package which came with ubuntu. These nodes had no issues at all. We tried one data node with oracle-java9-installer package and this node is the only one which gave us problems.

Update: I have reinstalled the problematic node using the same java package as the other servers are using it and now it cannot join the cluster at all:

[2017-12-12T07:22:06,921][WARN ][o.e.n.Node ] [elastic-data-4] timed out while waiting for initial discovery state - timeout: 30s
[2017-12-12T07:22:06,928][INFO ][o.e.h.n.Netty4HttpServerTransport] [elastic-data-4] publish_address {10.0.2.10:9200}, bound_addresses {[::]:9200}
[2017-12-12T07:22:06,928][INFO ][o.e.n.Node ] [elastic-data-4] started
[2017-12-12T07:22:39,949][INFO ][o.e.d.z.ZenDiscovery ] [elastic-data-4] failed to send join request to master [{elastic-master-1}{gKj54t0yQ5eY6OXhUgU6RQ}{JFRliz5_Q-2pOechraGKrg}{10.0.2.4}{10.0.2.4:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

dadoonet · December 12, 2017, 8:09am

We tried one data node with oracle-java9-installer package and this node is the only one which gave us problems.

Java9 is not supported for now. See Support Matrix | Elastic

I can also see in your logs many GC warnings. It sounds like your nodes are under memory pressure?

zozo6015 · December 12, 2017, 8:12am

I have downgraded the java to 8 and I am getting timeouts on discovery and joining masters as it can be seem above. Not sure what has changed but this is getting to be a complete headache.

dadoonet · December 12, 2017, 8:26am

Can you share the logs of the current master node?

zozo6015 · December 12, 2017, 9:44am

It seam that the masters were using GCE discovery which have failed and since the cluster needs two master nodes it have failed.

dadoonet · December 12, 2017, 5:03pm

So you know how to fix or do you want to share your logs?

zozo6015 · December 12, 2017, 5:17pm

Changed the configuration of the master nodes to unicast discovery instead of GCE discovery which nowadays are failing quite often for apparently no reason and everything went back to normal.

system · January 9, 2018, 5:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster: node not able to connect to cluster Elasticsearch	1	847	July 5, 2017
Data nodes failed to send join request to master Elasticsearch	2	863	May 17, 2018
Elasticsearch 6.0.0 gce discovery failing Elasticsearch	13	1787	January 15, 2018
Cannot join nodes to master Elasticsearch	4	596	February 12, 2020
Failed to send join request to master, discovery timed out Elasticsearch	2	4034	December 22, 2017

Elasticsearch data node fail to join the cluster

Related topics