[Elasticsearch 7.1.1] Node failed to join cluster with GCE discovery plugin due to different cluster uuid

haotian9850 · July 19, 2019, 1:10am

Hey there, I am currently making an automated elasticsearch cluster deployment on Google Compute Platform with terraform and packer. For a testing 2-node test deployment, the config files (elasticsearch.yml) were as below:
node-1 (private IP = 10.15.0.80) path.logs: /elasticsearch/logs cluster.name: testing-es-cluster-gce discovery.zen.minimum_master_nodes: 1 cluster.initial_master_nodes: [node-10.15.0.80] cloud.gce.project_id: dev-cortex-global-01 discovery.seed_providers: gce http.port: 9200 node.master: true path.data: /elasticsearch/data network.host: 10.15.0.80 node.name: node-10.15.0.80 cloud.gce.zone: us-central1-a
and this is node 2's elasticsearch.yml
path.logs: /elasticsearch/logs cluster.name: testing-es-cluster-gce discovery.zen.minimum_master_nodes: 1 cluster.initial_master_nodes:[node-10.15.0.95] cloud.gce.project_id: dev-cortex-global-01 discovery.seed_providers: gce http.port: 9200 node.master: true path.data: /elasticsearch/data network.host: 10.15.0.95 node.name: node-10.15.0.95 cloud.gce.zone: us-central1-a
The above 2 nodes (VM instances) are deployed in the same GCP project and zone and both of their port 9200 and 9300 are not blocked by anything. However, it seems when I start elasticsearch (which I configured as a systemd service) by systemctl start elasticsearch on both instances neither is able to discover the other. The log from node 1 is as follow
... Jul 19 01:03:29 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:29,399][INFO ][o.e.c.g.GceInstancesServiceImpl] [node-10.15.0.95] starting GCE discovery service Jul 19 01:03:29 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:29,648][INFO ][o.e.c.s.MasterService ] [node-10.15.0.95] elected-as-master ([1] nodes joined)[{node-10.15.0.95}{uu1BhjLbSQOS3JLUbQprQQ}{cUeLygwyRK688mApErNa1Q}{10.15.0 .95}{10.15.0.95:9300}{ml.machine_memory=7835996160, xpack.installed=true, ml.max_open_jobs=20} elect leader, BECOME_MASTER_TASK, FINISH_ELECTION], term: 1, version: 1, reason: master node changed {previous , current [{node-10.15.0.95}{uu1Bh jLbSQOS3JLUbQprQQ}{cUeLygwyRK688mApErNa1Q}{10.15.0.95}{10.15.0.95:9300}{ml.machine_memory=7835996160, xpack.installed=true, ml.max_open_jobs=20}]} Jul 19 01:03:29 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:29,744][INFO ][o.e.c.c.CoordinationState] [node-10.15.0.95] cluster UUID set to [7BBzJqO4QDKsCpYDpfJ1RQ] Jul 19 01:03:29 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:29,793][INFO ][o.e.c.s.ClusterApplierService] [node-10.15.0.95] master node changed {previous , current [{node-10.15.0.95}{uu1BhjLbSQOS3JLUbQprQQ}{cUeLygwyRK688mApErNa1 Q}{10.15.0.95}{10.15.0.95:9300}{ml.machine_memory=7835996160, xpack.installed=true, ml.max_open_jobs=20}]}, term: 1, version: 1, reason: Publication{term=1, version=1} Jul 19 01:03:30 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:30,039][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [node-10.15.0.95] Failed to clear cache for realms [] Jul 19 01:03:30 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:30,049][INFO ][o.e.h.AbstractHttpServerTransport] [node-10.15.0.95] publish_address {10.15.0.95:9200}, bound_addresses {10.15.0.95:9200} Jul 19 01:03:30 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:30,051][INFO ][o.e.n.Node ] [node-10.15.0.95] started Jul 19 01:03:30 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:30,428][INFO ][o.e.g.GatewayService ] [node-10.15.0.95] recovered [0] indices into cluster_state
and node 2's log:
... Jul 19 01:03:43 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:43,813][INFO ][o.e.c.g.GceInstancesServiceImpl] [node-10.15.0.80] starting GCE discovery service Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,155][INFO ][o.e.c.s.MasterService ] [node-10.15.0.80] elected-as-master ([1] nodes joined)[{node-10.15.0.80}{xRijgHF8SXWh36jZ9OVmZg}{YeUdSQ5RQWybnN-dPRD4Xg}{10.15.0 .80}{10.15.0.80:9300}{ml.machine_memory=7836004352, xpack.installed=true, ml.max_open_jobs=20} elect leader, BECOME_MASTER_TASK, FINISH_ELECTION], term: 1, version: 1, reason: master node changed {previous , current [{node-10.15.0.80}{xRijg HF8SXWh36jZ9OVmZg}{YeUdSQ5RQWybnN-dPRD4Xg}{10.15.0.80}{10.15.0.80:9300}{ml.machine_memory=7836004352, xpack.installed=true, ml.max_open_jobs=20}]} Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,234][INFO ][o.e.c.c.CoordinationState] [node-10.15.0.80] cluster UUID set to [F2q29Cc9So-3Wt-hJm8R1w] Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,260][INFO ][o.e.c.s.ClusterApplierService] [node-10.15.0.80] master node changed {previous , current [{node-10.15.0.80}{xRijgHF8SXWh36jZ9OVmZg}{YeUdSQ5RQWybnN-dPRD4X g}{10.15.0.80}{10.15.0.80:9300}{ml.machine_memory=7836004352, xpack.installed=true, ml.max_open_jobs=20}]}, term: 1, version: 1, reason: Publication{term=1, version=1} Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,449][INFO ][o.e.h.AbstractHttpServerTransport] [node-10.15.0.80] publish_address {10.15.0.80:9200}, bound_addresses {10.15.0.80:9200} Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,451][INFO ][o.e.n.Node ] [node-10.15.0.80] started Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,454][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [node-10.15.0.80] Failed to clear cache for realms [] Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,660][INFO ][o.e.g.GatewayService ] [node-10.15.0.80] recovered [0] indices into cluster_state
After a little research I discovered that it might be the different cluster uuid that is preventing node from joining the cluster. Yet the cluster name of both nodes are set to the same (via another systemd service that has a Before= dependency on the elasticsearch service - so elasticsearch.yml will always be written before elasticsearch is started on each node). I am wondering what could be a possible cause of the inconsistency over cluster uuid (that I suspect is causing node not being able to join)? What can I do to fix it? Thanks!!!

DavidTurner · July 19, 2019, 6:56am

You have instructed the nodes to form separate one-node clusters:

From the docs:

WARNING: You must set cluster.initial_master_nodes to the same list of nodes on each node on which it is set in order to be sure that only a single cluster forms during bootstrapping and therefore to avoid the risk of data loss.

DavidTurner · July 19, 2019, 7:12am

Also this setting is deprecated and does nothing in a new v7 cluster. If this were a v6-or-earlier cluster then setting it to 1 like this would lead to data loss. I suggest you remove this line entirely.

haotian9850 · July 19, 2019, 7:55am

Hi David, thank you so much for your reply! I think your suggestion would solve the issue. However, as I am doing fully automated builds deployed on GCP (there is no way that I can assign each node a unique name as they are coming from the same packer machine image before booting them up - I have another provisioning script that writes to elasticsearch.yml and name the node after its private IP address when the instances get started first time), I am wondering if there is a workaround that does not require specifying cluster.initial_master_modes in elasticsearch.yml (similarly, discovery.seed_hosts wouldnt work since the node IPs are not known to me beforehand)? Thanks!

DavidTurner · July 19, 2019, 8:17am

No, cluster.initial_master_nodes is required on at least one node, and must be set consistently on all nodes on which it is set. I don't think that GCP offers an API that could be used for this, so the gce-discovery plugin cannot really help to auto-discover a suitable value. Note that it's only required when provisioning a brand-new cluster and should be removed from the config once the cluster is up and running. I think the simplest solution is to fix the names of the first few master-eligible nodes up front rather than generating them dynamically like you are doing.

haotian9850 · July 19, 2019, 7:40pm

Thanks David! I think I will just implement another service that can write the full cluster.initial_master_nodes to elasticsearch.yml. Can a multi-node cluster be bootstrapped when only one mater-eligible node has cluster.initial_master_nodes set (the others will not have this entry in their .yml config)? I am trying to minimize the number of http requests I have to make as for a large cluster writing to every node might potentially consume a large amount of time. Thanks!

DavidTurner · July 19, 2019, 11:26pm

Yes, although if that node fails before the cluster has fully formed then you may have to start again from scratch. However ...

... this setting should only be set on master-eligible nodes and you should not have very many of them. It has no effect on nodes with node.master: false.

haotian9850 · July 25, 2019, 6:18pm

Thanks David, the problem is now resolved!

system · August 22, 2019, 6:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to form cluster in GCP with discovery-gce plugin Elasticsearch docker	17	865	March 9, 2020
Node Discovery Elasticsearch 7.0.0 Elasticsearch	13	2933	June 6, 2019
ES v 6.5.4 Not enough master nodes discovered during pinging - gce discovery Elasticsearch	10	1011	February 17, 2019
Nodes aren't joining cluster with ec2 discovery Elasticsearch	3	764	July 6, 2017
No nodes will join cluster Elasticsearch	3	1446	July 6, 2017

[Elasticsearch 7.1.1] Node failed to join cluster with GCE discovery plugin due to different cluster uuid

Related topics