Elasticsearch data node fail to join the cluster

Hello

I have an elasticsearch 6.0.0 cluster running on GCE environment and I want to add a data node but this node is failing to join the cluster.

I am getting the following in the logs:

2017-12-07T15:50:42,782][WARN ][o.e.n.Node               ] [elastic-data-4] timed out while waiting for initial discovery state - timeout: 30s
[2017-12-07T15:50:42,807][INFO ][o.e.h.n.Netty4HttpServerTransport] [elastic-data-4] publish_address {10.0.2.10:9200}, bound_addresses {[::]:9200}
[2017-12-07T15:50:42,807][INFO ][o.e.n.Node               ] [elastic-data-4] started
[2017-12-07T15:51:15,953][INFO ][o.e.d.z.ZenDiscovery     ] [elastic-data-4] failed to send join request to     master [{elastic-master-2}{3MYOSUphRnSa0dGIB8cjFQ}{lfaoDLVQSlKRZPUumEYclA}{10.0.2.3}{10.0.2.3:9300}], reason [ElasticsearchTimeou
tException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

the node can curl any other nodes on port 9200 or 9300 therefore it is not a network issue.

Any suggestion would be helpfull.

Are you using the GCE discovery Plugin?

I believe you are not, but prefer asking.

What are your config files for both nodes?

No I am not using GCE discovery plugin. The config on all 4 data nodes are the same.

cluster.name: CLUSTERNAME
node.name: ${HOSTNAME}
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["10.0.2.3", "10.0.2.4", "10.0.2.5", "10.0.2.6", "10.0.2.9", "10.0.2.2", "10.0.2.10"}]
discovery.zen.minimum_master_nodes: 2
path.data: /var/lib/elasticsearch/data-1, /var/lib/elasticsearch/data-2, /var/lib/elasticsearch/data-3, /var/lib/elasticsearch/data-4, /var/lib/elasticsearch/data-5, /var/lib/elasticsearch/data-6, /var/lib/elasticsearch/data-7, /var/lib/elasticsearch/data-8
node.master: false
node.data: true
node.ingest: false

What gives GET _cat/nodes?v ?

Nothing, it just hangs.

Uh?

So please describe exactly your configuration. How many nodes do you have in total? Which are the master nodes? Are they started? If so share the logs please.

It turned out that the new node had some disk issues into the LVM and that made elasticsearch act crazy. We have changed the data.path strategy from using LVM to multiplie path and that made the whole thing stable again and is working for almost 2 days.

Thank you for your time.

It happened again when a node fell out of the cluster for some sync failure. https://pastebin.com/m8P5LJNF

This time I can see some input/output error but I don't think it is the disk related. After resetting the node it never managed to get in touch with the elasticsearch cluster again.

Regards,

To answer to your previous questions. We have 4 data nodes and 3 master nodes. All master nodes are up and running and they never had any issues. Also all the master nodes and 3 of the data nodes are using default-jvm package which came with ubuntu. These nodes had no issues at all. We tried one data node with oracle-java9-installer package and this node is the only one which gave us problems.

Update: I have reinstalled the problematic node using the same java package as the other servers are using it and now it cannot join the cluster at all:

[2017-12-12T07:22:06,921][WARN ][o.e.n.Node ] [elastic-data-4] timed out while waiting for initial discovery state - timeout: 30s
[2017-12-12T07:22:06,928][INFO ][o.e.h.n.Netty4HttpServerTransport] [elastic-data-4] publish_address {10.0.2.10:9200}, bound_addresses {[::]:9200}
[2017-12-12T07:22:06,928][INFO ][o.e.n.Node ] [elastic-data-4] started
[2017-12-12T07:22:39,949][INFO ][o.e.d.z.ZenDiscovery ] [elastic-data-4] failed to send join request to master [{elastic-master-1}{gKj54t0yQ5eY6OXhUgU6RQ}{JFRliz5_Q-2pOechraGKrg}{10.0.2.4}{10.0.2.4:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

We tried one data node with oracle-java9-installer package and this node is the only one which gave us problems.

Java9 is not supported for now. See Support Matrix | Elastic

I can also see in your logs many GC warnings. It sounds like your nodes are under memory pressure?

I have downgraded the java to 8 and I am getting timeouts on discovery and joining masters as it can be seem above. Not sure what has changed but this is getting to be a complete headache.

Can you share the logs of the current master node?

It seam that the masters were using GCE discovery which have failed and since the cluster needs two master nodes it have failed.

So you know how to fix or do you want to share your logs?

Changed the configuration of the master nodes to unicast discovery instead of GCE discovery which nowadays are failing quite often for apparently no reason and everything went back to normal.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.