So please describe exactly your configuration. How many nodes do you have in total? Which are the master nodes? Are they started? If so share the logs please.
It turned out that the new node had some disk issues into the LVM and that made elasticsearch act crazy. We have changed the data.path strategy from using LVM to multiplie path and that made the whole thing stable again and is working for almost 2 days.
This time I can see some input/output error but I don't think it is the disk related. After resetting the node it never managed to get in touch with the elasticsearch cluster again.
To answer to your previous questions. We have 4 data nodes and 3 master nodes. All master nodes are up and running and they never had any issues. Also all the master nodes and 3 of the data nodes are using default-jvm package which came with ubuntu. These nodes had no issues at all. We tried one data node with oracle-java9-installer package and this node is the only one which gave us problems.
Update: I have reinstalled the problematic node using the same java package as the other servers are using it and now it cannot join the cluster at all:
[2017-12-12T07:22:06,921][WARN ][o.e.n.Node ] [elastic-data-4] timed out while waiting for initial discovery state - timeout: 30s
[2017-12-12T07:22:06,928][INFO ][o.e.h.n.Netty4HttpServerTransport] [elastic-data-4] publish_address {10.0.2.10:9200}, bound_addresses {[::]:9200}
[2017-12-12T07:22:06,928][INFO ][o.e.n.Node ] [elastic-data-4] started
[2017-12-12T07:22:39,949][INFO ][o.e.d.z.ZenDiscovery ] [elastic-data-4] failed to send join request to master [{elastic-master-1}{gKj54t0yQ5eY6OXhUgU6RQ}{JFRliz5_Q-2pOechraGKrg}{10.0.2.4}{10.0.2.4:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
I have downgraded the java to 8 and I am getting timeouts on discovery and joining masters as it can be seem above. Not sure what has changed but this is getting to be a complete headache.
Changed the configuration of the master nodes to unicast discovery instead of GCE discovery which nowadays are failing quite often for apparently no reason and everything went back to normal.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.