[Elasticsearch 7.1.1] Node failed to join cluster with GCE discovery plugin due to different cluster uuid

Hey there, I am currently making an automated elasticsearch cluster deployment on Google Compute Platform with terraform and packer. For a testing 2-node test deployment, the config files (elasticsearch.yml) were as below:

node-1 (private IP = 10.15.0.80)
path.logs: /elasticsearch/logs
cluster.name: testing-es-cluster-gce
discovery.zen.minimum_master_nodes: 1
cluster.initial_master_nodes: [node-10.15.0.80]
cloud.gce.project_id: dev-cortex-global-01
discovery.seed_providers: gce
http.port: 9200
node.master: true
path.data: /elasticsearch/data
network.host: 10.15.0.80
node.name: node-10.15.0.80
cloud.gce.zone: us-central1-a

and this is node 2's elasticsearch.yml

path.logs: /elasticsearch/logs
cluster.name: testing-es-cluster-gce
discovery.zen.minimum_master_nodes: 1
cluster.initial_master_nodes:[node-10.15.0.95]
cloud.gce.project_id: dev-cortex-global-01
discovery.seed_providers: gce
http.port: 9200
node.master: true
path.data: /elasticsearch/data
network.host: 10.15.0.95
node.name: node-10.15.0.95
cloud.gce.zone: us-central1-a

The above 2 nodes (VM instances) are deployed in the same GCP project and zone and both of their port 9200 and 9300 are not blocked by anything. However, it seems when I start elasticsearch (which I configured as a systemd service) by systemctl start elasticsearch on both instances neither is able to discover the other. The log from node 1 is as follow

...
Jul 19 01:03:29 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:29,399][INFO ][o.e.c.g.GceInstancesServiceImpl] [node-10.15.0.95] starting GCE discovery service
Jul 19 01:03:29 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:29,648][INFO ][o.e.c.s.MasterService ] [node-10.15.0.95] elected-as-master ([1] nodes joined)[{node-10.15.0.95}{uu1BhjLbSQOS3JLUbQprQQ}{cUeLygwyRK688mApErNa1Q}{10.15.0
.95}{10.15.0.95:9300}{ml.machine_memory=7835996160, xpack.installed=true, ml.max_open_jobs=20} elect leader, BECOME_MASTER_TASK, FINISH_ELECTION], term: 1, version: 1, reason: master node changed {previous , current [{node-10.15.0.95}{uu1Bh
jLbSQOS3JLUbQprQQ}{cUeLygwyRK688mApErNa1Q}{10.15.0.95}{10.15.0.95:9300}{ml.machine_memory=7835996160, xpack.installed=true, ml.max_open_jobs=20}]}
Jul 19 01:03:29 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:29,744][INFO ][o.e.c.c.CoordinationState] [node-10.15.0.95] cluster UUID set to [7BBzJqO4QDKsCpYDpfJ1RQ]
Jul 19 01:03:29 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:29,793][INFO ][o.e.c.s.ClusterApplierService] [node-10.15.0.95] master node changed {previous , current [{node-10.15.0.95}{uu1BhjLbSQOS3JLUbQprQQ}{cUeLygwyRK688mApErNa1
Q}{10.15.0.95}{10.15.0.95:9300}{ml.machine_memory=7835996160, xpack.installed=true, ml.max_open_jobs=20}]}, term: 1, version: 1, reason: Publication{term=1, version=1}
Jul 19 01:03:30 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:30,039][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [node-10.15.0.95] Failed to clear cache for realms []
Jul 19 01:03:30 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:30,049][INFO ][o.e.h.AbstractHttpServerTransport] [node-10.15.0.95] publish_address {10.15.0.95:9200}, bound_addresses {10.15.0.95:9200}
Jul 19 01:03:30 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:30,051][INFO ][o.e.n.Node ] [node-10.15.0.95] started
Jul 19 01:03:30 haotian-tf-es-node-0 elasticsearch[2766]: [2019-07-19T01:03:30,428][INFO ][o.e.g.GatewayService ] [node-10.15.0.95] recovered [0] indices into cluster_state

and node 2's log:

...
Jul 19 01:03:43 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:43,813][INFO ][o.e.c.g.GceInstancesServiceImpl] [node-10.15.0.80] starting GCE discovery service
Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,155][INFO ][o.e.c.s.MasterService ] [node-10.15.0.80] elected-as-master ([1] nodes joined)[{node-10.15.0.80}{xRijgHF8SXWh36jZ9OVmZg}{YeUdSQ5RQWybnN-dPRD4Xg}{10.15.0
.80}{10.15.0.80:9300}{ml.machine_memory=7836004352, xpack.installed=true, ml.max_open_jobs=20} elect leader, BECOME_MASTER_TASK, FINISH_ELECTION], term: 1, version: 1, reason: master node changed {previous , current [{node-10.15.0.80}{xRijg
HF8SXWh36jZ9OVmZg}{YeUdSQ5RQWybnN-dPRD4Xg}{10.15.0.80}{10.15.0.80:9300}{ml.machine_memory=7836004352, xpack.installed=true, ml.max_open_jobs=20}]}
Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,234][INFO ][o.e.c.c.CoordinationState] [node-10.15.0.80] cluster UUID set to [F2q29Cc9So-3Wt-hJm8R1w]
Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,260][INFO ][o.e.c.s.ClusterApplierService] [node-10.15.0.80] master node changed {previous , current [{node-10.15.0.80}{xRijgHF8SXWh36jZ9OVmZg}{YeUdSQ5RQWybnN-dPRD4X
g}{10.15.0.80}{10.15.0.80:9300}{ml.machine_memory=7836004352, xpack.installed=true, ml.max_open_jobs=20}]}, term: 1, version: 1, reason: Publication{term=1, version=1}
Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,449][INFO ][o.e.h.AbstractHttpServerTransport] [node-10.15.0.80] publish_address {10.15.0.80:9200}, bound_addresses {10.15.0.80:9200}
Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,451][INFO ][o.e.n.Node ] [node-10.15.0.80] started
Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,454][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [node-10.15.0.80] Failed to clear cache for realms []
Jul 19 01:03:44 haotian-tf-es-node-1 elasticsearch[2998]: [2019-07-19T01:03:44,660][INFO ][o.e.g.GatewayService ] [node-10.15.0.80] recovered [0] indices into cluster_state

After a little research I discovered that it might be the different cluster uuid that is preventing node from joining the cluster. Yet the cluster name of both nodes are set to the same (via another systemd service that has a Before= dependency on the elasticsearch service - so elasticsearch.yml will always be written before elasticsearch is started on each node). I am wondering what could be a possible cause of the inconsistency over cluster uuid (that I suspect is causing node not being able to join)? What can I do to fix it? Thanks!!!

You have instructed the nodes to form separate one-node clusters:

From the docs:

WARNING: You must set cluster.initial_master_nodes to the same list of nodes on each node on which it is set in order to be sure that only a single cluster forms during bootstrapping and therefore to avoid the risk of data loss.

1 Like

Also this setting is deprecated and does nothing in a new v7 cluster. If this were a v6-or-earlier cluster then setting it to 1 like this would lead to data loss. I suggest you remove this line entirely.

Hi David, thank you so much for your reply! I think your suggestion would solve the issue. However, as I am doing fully automated builds deployed on GCP (there is no way that I can assign each node a unique name as they are coming from the same packer machine image before booting them up - I have another provisioning script that writes to elasticsearch.yml and name the node after its private IP address when the instances get started first time), I am wondering if there is a workaround that does not require specifying cluster.initial_master_modes in elasticsearch.yml (similarly, discovery.seed_hosts wouldnt work since the node IPs are not known to me beforehand)? Thanks!

No, cluster.initial_master_nodes is required on at least one node, and must be set consistently on all nodes on which it is set. I don't think that GCP offers an API that could be used for this, so the gce-discovery plugin cannot really help to auto-discover a suitable value. Note that it's only required when provisioning a brand-new cluster and should be removed from the config once the cluster is up and running. I think the simplest solution is to fix the names of the first few master-eligible nodes up front rather than generating them dynamically like you are doing.

Thanks David! I think I will just implement another service that can write the full cluster.initial_master_nodes to elasticsearch.yml. Can a multi-node cluster be bootstrapped when only one mater-eligible node has cluster.initial_master_nodes set (the others will not have this entry in their .yml config)? I am trying to minimize the number of http requests I have to make as for a large cluster writing to every node might potentially consume a large amount of time. Thanks!

Yes, although if that node fails before the cluster has fully formed then you may have to start again from scratch. However ...

... this setting should only be set on master-eligible nodes and you should not have very many of them. It has no effect on nodes with node.master: false.

Thanks David, the problem is now resolved!

1 Like