ES v 6.5.4 Not enough master nodes discovered during pinging - gce discovery

I have a small ES 6.5.4 cluster (3 nodes) I'm trying to get going in a google cloud project.

I have the latest gce - discovery plugin installed (6.5.4), however it seems like it starts and I never see anything in the logs related to the gce discovery service, besides that it's been started and "timed out while waiting for initial discovery state - timeout: 30s". Then the slew of "Not enough master nodes discovered during pinging" messages from the ZenDiscovery continue. Thus, the nodes are never able to talk to each other. I also tried adding the additional trace logging levels for the discovery plugin but to no avail. I don't know if it's actually working the way it's supposed to, I'm guessing this just starts when ES process is started.

The one time I was able to get the nodes to communicate with each other was by setting the "discovery.zen.ping.unicast.hosts:" on each node and plugging in my array of IPs manually. Unfortunately this method won't work for my situation as I need it to be more dynamic.

Here's a copy of my elasticsearch config from one of my nodes (I'm working off a config upgrading from 5.6, so I've commented out unnecessary items)

cluster.name: es-65x-development
node.name: es-65x-development-node-ntkc
node.master: true
node.data: true
node.ingest: true
# search.remote.connect: false -- deprecated
node.max_local_storage_nodes: 1

discovery.zen.minimum_master_nodes: 2


path:
  data: /var/lib/elasticsearch
  logs: /var/log/elasticsearch

bootstrap.memory_lock: true

network.host: _gce_
# network.bind_host: 10.24.8.57 defaults to network.host
# network.publish_host: 10.24.8.57
# transport.tcp.port: 9300-9400 -- defaults to 9300-9400
# transport.tcp.compress: false
# http.port: 9200-9300 -- defaults to 9200-9300
http.max_content_length: 100mb
# http.enabled: true --deprecated

http.cors.enabled: false

monitor.jvm.gc.overhead.warn: 100
monitor.jvm.gc.overhead.info: 50
monitor.jvm.gc.overhead.debug: 20

script.allowed_types: inline
script.allowed_contexts: search, update

cloud:
  gce:
    project_id: xyz-development
    zone: us-central1-b
discovery:
  zen.hosts_provider: gce
  gce:
    tags: es-65x-development

I would like to note that I am able to telnet successfully to each node from one another using the ip and port 9200.

If I try to curl the ip that 9200 is bound to on each of my nodes it returns something like below:

curl http://10.24.8.58:9200

>    {
>       "name" : "es-65x-development-node-w9b9",
>       "cluster_name" : "es-65x-development",
>       "cluster_uuid" : "_na_",
>       "version" : {
>         "number" : "6.5.4",
>         "build_flavor" : "default",
>         "build_type" : "deb",
>         "build_hash" : "d2ef93d",
>         "build_date" : "2018-12-17T21:17:40.758843Z",
>         "build_snapshot" : false,
>         "lucene_version" : "7.5.0",
>         "minimum_wire_compatibility_version" : "5.6.0",
>         "minimum_index_compatibility_version" : "5.0.0"
> 

If I do a

netstat -tuple

I can see elasttic search listening on 9200 and 9300 (both tcp6, not sure if that matters)

I'm also able to verify 9200 and 9300 are open from a different node via nmap, as well as confirming there is no firewall enabled on each node with a

sudo ufw status

showing that it is inactive.

Again, these were able to communicate when I set the IPs of the discovery zen hosts manually, but I need to be able to use the gce discovery tool.

Let me know if there's any more information I can provide. I tried searching for this error but I can't seem to find anything that applies to my situation or my version of Elasticsearch, in conjunction with the gce discovery tool.

Thanks in advance!

Sample of logs from one of my nodes:

-01-19T15:12:53,345][WARN ][o.e.d.z.ZenDiscovery ] [es-65x-development-node-w9b9] not enough master nodes discovered during pinging(found [[Candidate{node={es-65x-development-node-w9b9}{w60MIXdGR0m1yFNoVfYbxA}{GXnxS1CYRwuPasmYgOEdAQ}{10.24.8.58}{10.24.8.58:9300}{ml.machine_memory=3877261312, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-01-19T15:12:56,529][WARN ][o.e.d.z.ZenDiscovery ] [es-65x-development-node-w9b9] not enough master nodes discovered during pinging(found [[Candidate{node={es-65x-development-node-w9b9}{w60MIXdGR0m1yFNoVfYbxA}{GXnxS1CYRwuPasmYgOEdAQ}{10.24.8.58}{10.24.8.58:9300}{ml.machine_memory=3877261312, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-01-19T15:12:59,691][WARN ][o.e.d.z.ZenDiscovery ] [es-65x-development-node-w9b9] not enough master nodes discovered during pinging(found [[Candidate{node={es-65x-development-node-w9b9}{w60MIXdGR0m1yFNoVfYbxA}{GXnxS1CYRwuPasmYgOEdAQ}{10.24.8.58}{10.24.8.58:9300}{ml.machine_memory=3877261312, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again

It sounds like the discovery-gce plugin is not finding the addresses of the other nodes for some reason, and it'd be useful to see the trace-level logs. Could you set logger.org.elasticsearch.discovery.gce: TRACE and try again, as this will provide a lot more information about how Elasticsearch is interacting with GCE.

1 Like

Ok, with that logging level I'm able to see it pinging other machinesin my project, as well as the machines I'm interested in,:


image

It's like it sees my machines and their tags but doesn't want to match them for some reason

I'm currently reading the cluster name into my discovery gce tag in the elasticsearch yml when the machine is built.

Sorry for the confusion, yes, I meant to add that line to the elasticsearch.yml file. Could you post the actual logs, formatted with </>. Some of us can't read images of text (e.g. if using a screenreader) and the images you posted are heavily cropped so it's impossible to tell what's going on anyway.

Sorry about that, hopefully this is a bit better. Here are the sections where it's seeing my nodes.

[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] gce instance es-65x-development-node-ntkc with status RUNNING found.
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] start filtering instance es-65x-development-node-ntkc with tags [es-65x-development].
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] comparing instance tags [es-56x-development] with tags filter [es-65x-development].
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] filtering out instance es-65x-development-node-ntkc based tags [es-65x-development], not part of {fingerprint=HmdxWPuU5nU=, items=[es-56x-development]}
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] gce instance es-65x-development-node-w2bb with status RUNNING found.
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] start filtering instance es-65x-development-node-w2bb with tags [es-65x-development].
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] comparing instance tags [es-56x-development] with tags filter [es-65x-development].
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] filtering out instance es-65x-development-node-w2bb based tags [es-65x-development], not part of {fingerprint=HmdxWPuU5nU=, items=[es-56x-development]}
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] gce instance es-65x-development-node-w9b9 with status RUNNING found.
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] start filtering instance es-65x-development-node-w9b9 with tags [es-65x-development].
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] comparing instance tags [es-56x-development] with tags filter [es-65x-development].
[2019-01-19T20:42:55,772][TRACE][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] filtering out instance es-65x-development-node-w9b9 based tags [es-65x-development], not part of {fingerprint=HmdxWP

And it looks like these are the last lines it throws before it says Not enough master nodes.
[DEBUG][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] 0 addresses added
[DEBUG][o.e.d.g.GceUnicastHostsProvider] [a5SBNyj] using transport addresses

Thanks, that tells us that it's not using these nodes because they are tagged as es-56x-development but you are asking for nodes tagged as es-65x-development. Note the order of the 5 and the 6.

So the es-56x-development is actually a totally different, separate cluster (and an older version).

Currently there's two instance groups / clusters. in my project:

es-56x-development (old)
es-65x-development (new) <--- this is the one I'm trying to get working

If for some reason the name of the cluster is actually the issue and it can't match on that, then I can just change that

I think you need to check carefully that the tags that these instances have are what you expect, because these logs indicate that you have a node called es-65x-development-node-ntkc which is tagged as es-56x-development and not es-65x-development and that seems to be an important mismatch.

David,

Just want to say thank you so much! This was something to easy to miss on my end. In the google project template for the older version, I had done Create From Similar, I simply missed the Network Tag, which was set to the old es-56x-development. Replaced that, fired up new instances, and I'm able to see them running in Cerebro!

Glad it was an easy fix in the end at least,

Cheers

1 Like

Sometimes all you need is a fresh pair of eyes. We've all been there. Glad we could help. :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.