Unable to form cluster in GCP with discovery-gce plugin

I am trying to setup an 3-node ES cluster in GCP over 3 VM instances. I have created the 3 VMs and am trying to setup ES in each using docker-compose referring to the original ES documentation. Yet, I am unable to form cluster. The error I get is :

elasticsearch01 | {"type": "server", "timestamp": "2020-02-07T08:22:38,765Z", "level": "DEBUG", "component": "o.e.a.s.m.TransportMasterNodeAction", "cluster.name": "elk-docker-cluster", "node.name": "elasticsearch01", "message": "no known master node, scheduling a retry" }
elasticsearch01 | {"type": "server", "timestamp": "2020-02-07T08:22:38,766Z", "level": "DEBUG", "component": "o.e.a.s.m.TransportMasterNodeAction", "cluster.name": "elk-docker-cluster", "node.name": "elasticsearch01", "message": "no known master node, scheduling a retry" }
elasticsearch01 | {"type": "server", "timestamp": "2020-02-07T08:22:40,265Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elk-docker-cluster", "node.name": "elasticsearch01", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [WGoBC5OOTZa69SmGH7KRwg, P11WUJLoSHqc22Uy2MIhVA, E0N7QP2QRRWFwggnJqz1HA], have discovered [{elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{H4BFylE-TlWSdNE1oeCz7A}{172.19.0.2}{172.19.0.2:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.203.116.114:9300, 10.203.116.15:9300, 10.203.116.108:9300, 10.203.116.16:9300, 10.203.119.249:9300, 10.203.116.8:9300, 10.203.116.38:9300, 10.203.116.126:9300, 10.203.116.35:9300, 10.203.116.59:9300, 10.203.116.31:9300, 10.203.116.54:9300, 10.203.116.50:9300, 10.203.116.52:9300, 10.203.116.51:9300, 10.203.116.53:9300, 10.203.116.55:9300, 10.203.116.56:9300, 10.203.116.37:9300, 10.203.116.34:9300, 10.203.116.61:9300, 10.203.116.42:9300] from hosts providers and [{elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{H4BFylE-TlWSdNE1oeCz7A}{172.19.0.2}{172.19.0.2:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 2, last-accepted version 37 in term 2" }

My docker-compose.yaml file is :

version: '2.2'
services:
  es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.5.2
    container_name: es03
    command: >
      /bin/sh -c "./bin/elasticsearch-plugin list | grep -q discovery-gce
      || ./bin/elasticsearch-plugin install --batch discovery-gce;
      /usr/local/bin/docker-entrypoint.sh"
    environment:
      - node.name=es03
      - cluster.name=elk-docker-cluster
      - cluster.initial_master_nodes=10.203.116.108,10.203.116.114,10.203.116.15
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms16g -Xmx16g"
      - cloud.gce.project_id=mobile-ci-infra
      - cloud.gce.zone=asia-east1-a
      - discovery.seed_providers=gce
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
    networks:
      - elastic

volumes:
  data01:
    driver: local
networks:
  elastic:
    driver: bridge
/>

The response for API call http://10.203.116.108:9200/ is as follows :

{
  "name" : "elasticsearch03",
  "cluster_name" : "elk-docker-cluster",
  "cluster_uuid" : "_na_",
  "version" : {
    "number" : "7.5.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "8bec50e1e0ad29dad5653712cf3bb580cd1afcdf",
    "build_date" : "2020-01-15T12:11:52.313576Z",
    "build_snapshot" : false,
    "lucene_version" : "8.3.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

The cluster_uuid is _na_ for all 3 nodes.
Please help, I am at a loss as to what am I doing wrong.

Also, I am able to telnet to the VMs from each other at port 9200 and 9300. So the ports are not blocked either.

Fundamentally, this node cannot discover any other nodes.

This node reports its address as 172.19.0.2 but discovery is configured using 10.203.116.x addresses. Are you sure this node is accessible to other nodes at 172.19.0.2?

I also note that you're using a bridge network. The Docker docs indicate that this is not appropriate for clusters that span multiple hosts:

Bridge networks apply to containers running on the same Docker daemon host. For communication among containers running on different Docker daemon hosts, you can either manage routing at the OS level, or you can use an overlay network.

This is the docker IP which is being picked. Should I be setting the host machine's IP explicitly with network.publish_host?

I don't think you need to touch network.publish_host but you might need to set network.host. Note that you can instruct network.host to bind to a particular interface by setting it to a string such as _eth0_.

So I changed the network.host value to _eth0_ and it has started picking up the IP of the host itself. However, I am still unable to bring up the cluster. The error is still very similar :

{"type": "server", "timestamp": "2020-02-10T06:14:05,368Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elk-docker-cluster", "node.name": "elasticsearch01", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [WGoBC5OOTZa69SmGH7KRwg, P11WUJLoSHqc22Uy2MIhVA, E0N7QP2QRRWFwggnJqz1HA], have discovered [{elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{CTMAQ5aVRqyxZrN6PjcY3w}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}, {elasticsearch02}{LzgWiJ14RGWkqiyVLU06vw}{-hJczZfcQ_Kxe1KYzrvtyw}{10.203.116.15}{10.203.116.15:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.203.116.15:9300, 10.203.116.108:9300, 10.203.116.39:9300, 10.203.116.21:9300, 10.203.116.117:9300, 10.203.116.38:9300, 10.203.116.59:9300, 10.203.116.118:9300, 10.203.116.54:9300, 10.203.116.50:9300, 10.203.116.52:9300, 10.203.116.51:9300, 10.203.116.53:9300, 10.203.116.55:9300, 10.203.116.56:9300, 10.203.116.37:9300, 10.203.116.34:9300, 10.203.116.61:9300, 10.203.116.42:9300] from hosts providers and [{elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{CTMAQ5aVRqyxZrN6PjcY3w}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 2, last-accepted version 37 in term 2" }

I am able to telnet the other VMs from the container of this VM. So connection should not be a problem. Also, the cluster_uuid is still _na_ for me. Could that be an issue?

It is, yes - this still looks like a discovery problem, but the IP addresses look consistent now so it's not that.

Can you share the precise command you are using for this test?

Could you run curl -vv http://10.203.116.$N:9300/ from within this container (choose $N for one of the other master nodes) and share the full output too?

No, I think that's to be expected while the cluster is still forming.

Also, just to check the obvious thing, is this list of addresses correct? Does it contain the addresses of the other master-eligible nodes?

I was using telnet. Exact command : telnet 10.203.116.xx 9300 and it would say Connection successful.

Output of command curl -vv http://10.203.116.$N:9300/ :

*   Trying 10.203.116.15...
* TCP_NODELAY set
* Connected to 10.203.116.15 (10.203.116.15) port 9300 (#0)
> GET / HTTP/1.1
> Host: 10.203.116.15:9300
> User-Agent: curl/7.52.1
> Accept: */*
>
* Curl_http_done: called premature == 0
* Connection #0 to host 10.203.116.15 left intact

It does. The first 2 IPs are the other nodes.

Another thing that I have noticed is that an election requires at least 2 nodes with ids from [WGoBC5OOTZa69SmGH7KRwg, P11WUJLoSHqc22Uy2MIhVA, E0N7QP2QRRWFwggnJqz1HA]. How are these IDs formed? My docker-compose has the following 2 lines in all 3 VMs :

- cluster.initial_master_nodes=10.203.116.108,10.203.116.114,10.203.116.15
- discovery.seed_hosts=10.203.116.108,10.203.116.114,10.203.116.15

They're internal IDs, randomly generated the first time a node starts up. The node we're looking at has internal ID E0N7QP2QRRWFwggnJqz1HA which is in the list, so although it's a good observation I don't think the problem relates to these IDs.

Hmm, that isn't how a successful connection to Elasticsearch would respond. You should see a valid HTTP response containing the string This is not an HTTP port, whereas here there seems to be no response at all. Is it possible that you're connecting to something other than Elasticsearch?

Oh wait, are you using security (more precisely, is TLS enabled on the transport layer?) If so, could you try curl -vv -k https://10.203.116.15:900/ instead?

So let me share the response from all 3 nodes :
Node 1 :
an election requires at least 2 nodes with ids from [WGoBC5OOTZa69SmGH7KRwg, P11WUJLoSHqc22Uy2MIhVA, E0N7QP2QRRWFwggnJqz1HA], have discovered [{elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{9QjOj390Rou-L6x0FPjzZg}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}, {elasticsearch02}{LzgWiJ14RGWkqiyVLU06vw}{mPwQsb54Q6OgpUjN44geGQ}{10.203.116.15}{10.203.116.15:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}, {elasticsearch03}{XoLLxo3IS7qdFhlrKI5yZQ}{PhKA-IOUQtKCA2U7fPR6wQ}{10.203.116.108}{10.203.116.108:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum;

Node 2 :
an election requires 2 nodes with ids [LzgWiJ14RGWkqiyVLU06vw, E0N7QP2QRRWFwggnJqz1HA], have discovered [{elasticsearch02}{LzgWiJ14RGWkqiyVLU06vw}{mPwQsb54Q6OgpUjN44geGQ}{10.203.116.15}{10.203.116.15:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}, {elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{9QjOj390Rou-L6x0FPjzZg}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}, {elasticsearch03}{XoLLxo3IS7qdFhlrKI5yZQ}{PhKA-IOUQtKCA2U7fPR6wQ}{10.203.116.108}{10.203.116.108:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum;

Node 3 :
an election requires 2 nodes with ids [XoLLxo3IS7qdFhlrKI5yZQ, E0N7QP2QRRWFwggnJqz1HA], have discovered [{elasticsearch03}{XoLLxo3IS7qdFhlrKI5yZQ}{PhKA-IOUQtKCA2U7fPR6wQ}{10.203.116.108}{10.203.116.108:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}, {elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{9QjOj390Rou-L6x0FPjzZg}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}, {elasticsearch02}{LzgWiJ14RGWkqiyVLU06vw}{mPwQsb54Q6OgpUjN44geGQ}{10.203.116.15}{10.203.116.15:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum;

Somehow, this looks wrong to me. Although you would be able to judge better.

With respect to curl call, apologies, I skipped pasting the last line :
* Trying 10.203.116.15...
* TCP_NODELAY set
* Connected to 10.203.116.15 (10.203.116.15) port 9300 (#0)
> GET / HTTP/1.1
> Host: 10.203.116.15:9300
> User-Agent: curl/7.52.1
> Accept: /
>
* Curl_http_done: called premature == 0
* Connection #0 to host 10.203.116.15 left intact
This is not an HTTP port

Nope, Haven't enabled TLS on the transport layer.

Ah ok these messages are quite different from the ones above. This shows that discovery is now fixed, since all nodes have discovered all other nodes.

You've truncated these messages, and the bit you omitted is important. Can you share them in full?

Sure. By the way, please see the first node's message - It still ends with which is not a quorum

Logs :
Node 1 :
elasticsearch01 | {"type": "server", "timestamp": "2020-02-10T09:04:32,411Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elk-docker-cluster", "node.name": "elasticsearch01", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [WGoBC5OOTZa69SmGH7KRwg, P11WUJLoSHqc22Uy2MIhVA, E0N7QP2QRRWFwggnJqz1HA], have discovered [{elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{9QjOj390Rou-L6x0FPjzZg}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}, {elasticsearch02}{LzgWiJ14RGWkqiyVLU06vw}{mPwQsb54Q6OgpUjN44geGQ}{10.203.116.15}{10.203.116.15:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [10.203.116.108:9300, 10.203.116.15:9300, 10.203.116.15:9300, 10.203.116.108:9300, 10.203.116.39:9300, 10.203.116.21:9300, 10.203.116.117:9300, 10.203.116.62:9300, 10.203.116.38:9300, 10.203.116.63:9300, 10.203.116.59:9300, 10.203.116.118:9300, 10.203.116.54:9300, 10.203.116.50:9300, 10.203.116.52:9300, 10.203.116.51:9300, 10.203.116.53:9300, 10.203.116.55:9300, 10.203.116.56:9300, 10.203.116.37:9300, 10.203.116.34:9300, 10.203.116.61:9300, 10.203.116.42:9300, 10.203.116.60:9300] from hosts providers and [{elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{9QjOj390Rou-L6x0FPjzZg}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 2, last-accepted version 37 in term 2" }

Node 2 :
elasticsearch02 | {"type": "server", "timestamp": "2020-02-10T08:42:08,783Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elk-docker-cluster", "node.name": "elasticsearch02", "message": "master not discovered or elected yet, an election requires 2 nodes with ids [LzgWiJ14RGWkqiyVLU06vw, E0N7QP2QRRWFwggnJqz1HA], have discovered [{elasticsearch02}{LzgWiJ14RGWkqiyVLU06vw}{mPwQsb54Q6OgpUjN44geGQ}{10.203.116.15}{10.203.116.15:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}, {elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{9QjOj390Rou-L6x0FPjzZg}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}, {elasticsearch03}{XoLLxo3IS7qdFhlrKI5yZQ}{PhKA-IOUQtKCA2U7fPR6wQ}{10.203.116.108}{10.203.116.108:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [10.203.116.114:9300, 10.203.116.108:9300, 10.203.116.114:9300, 10.203.116.108:9300, 10.203.116.39:9300, 10.203.116.21:9300, 10.203.116.117:9300, 10.203.116.62:9300, 10.203.116.38:9300, 10.203.116.63:9300, 10.203.116.59:9300, 10.203.116.118:9300, 10.203.116.54:9300, 10.203.116.50:9300, 10.203.116.52:9300, 10.203.116.51:9300, 10.203.116.53:9300, 10.203.116.55:9300, 10.203.116.56:9300, 10.203.116.37:9300, 10.203.116.34:9300, 10.203.116.61:9300, 10.203.116.42:9300, 10.203.116.60:9300] from hosts providers and [{elasticsearch02}{LzgWiJ14RGWkqiyVLU06vw}{mPwQsb54Q6OgpUjN44geGQ}{10.203.116.15}{10.203.116.15:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

Node 3 :
elasticsearch03 | {"type": "server", "timestamp": "2020-02-10T08:49:57,692Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elk-docker-cluster", "node.name": "elasticsearch03", "message": "master not discovered or elected yet, an election requires 2 nodes with ids [XoLLxo3IS7qdFhlrKI5yZQ, E0N7QP2QRRWFwggnJqz1HA], have discovered [{elasticsearch03}{XoLLxo3IS7qdFhlrKI5yZQ}{PhKA-IOUQtKCA2U7fPR6wQ}{10.203.116.108}{10.203.116.108:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}, {elasticsearch01}{E0N7QP2QRRWFwggnJqz1HA}{9QjOj390Rou-L6x0FPjzZg}{10.203.116.114}{10.203.116.114:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}, {elasticsearch02}{LzgWiJ14RGWkqiyVLU06vw}{mPwQsb54Q6OgpUjN44geGQ}{10.203.116.15}{10.203.116.15:9300}{dilm}{ml.machine_memory=33742512128, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [10.203.116.114:9300, 10.203.116.15:9300, 10.203.116.114:9300, 10.203.116.15:9300, 10.203.116.39:9300, 10.203.116.21:9300, 10.203.116.117:9300, 10.203.116.62:9300, 10.203.116.38:9300, 10.203.116.63:9300, 10.203.116.59:9300, 10.203.116.118:9300, 10.203.116.54:9300, 10.203.116.50:9300, 10.203.116.52:9300, 10.203.116.51:9300, 10.203.116.53:9300, 10.203.116.55:9300, 10.203.116.56:9300, 10.203.116.37:9300, 10.203.116.34:9300, 10.203.116.61:9300, 10.203.116.42:9300, 10.203.116.60:9300] from hosts providers and [{elasticsearch03}{XoLLxo3IS7qdFhlrKI5yZQ}{PhKA-IOUQtKCA2U7fPR6wQ}{10.203.116.108}{10.203.116.108:9300}{dilm}{ml.machine_memory=33742512128, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

Yep, duly noted.

Ok, it looks like elasticsearch02 and elasticsearch03 were freshly-started (with an empty data directory) but elasticsearch01 used to belong to a cluster containing at least three master-eligible nodes; the other two nodes in that cluster are not here, and were not removed explicitly, so this paragraph of the docs applies:

More precisely, if you shut down half or more of the master-eligible nodes all at the same time then the cluster will normally become unavailable. If this happens then you can bring the cluster back online by starting the removed nodes again.

Can you bring those removed nodes back online again? Or else, if this is a development cluster, maybe it's simplest to just wipe all their data paths and start again.

Damn! I feel so silly. Someone else experimented on this VM and I wasn't aware that there were local nodes created (and later destroyed) on this node. Thank you so much. It works fine now.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.