Elasticsearch 6.0.0 gce discovery failing

zozo6015 · November 30, 2017, 9:23am

Hello,

I have setup an elasticsearch cluster on GCE with gce discovery plugin. Yesterday everything was OK and the nodes joined the cluster correctly but when I checked today I have noticed that the nodes are disconnected due to the GCE plugin failing.

Here are what I can see in the logs:

https://pastebin.com/S7GxdSjp

Any idea how to fix it?

Regards,
Peter

dadoonet · November 30, 2017, 12:15pm

[2017-11-30T09:10:25,684][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] disabling GCE discovery. Can not get list of nodes
    [2017-11-30T09:10:28,685][WARN ][o.e.d.z.ZenDiscovery     ] [es-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={es-master-1}{Kmb4RtHaTweY-15sIVM1XA}{x6NmW8AvTbeukZY4MBhatw}{10.0.2.10}{10.0.2.10:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
    [2017-11-30T09:15:26,524][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] Problem fetching instance list for zone europe-west3-a
    java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:400) ~[?:?]
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:243) ~[?:?]
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:225) ~[?:?]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:402) ~[?:?]
        at java.net.Socket.connect(Socket.java:591) ~[?:?]
        at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:657) ~[?:?]
        at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?]
        at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265) ~[?:?]
        at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:372) ~[?:?]
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1181) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1075) ~[?:?]
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) ~[?:?]
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:163) ~[?:?]
        at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93) ~[google-http-client-1.20.0.jar:1.20.0]
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972) ~[google-http-client-1.20.0.jar:1.20.0]
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) ~[google-api-client-1.20.0.jar:1.20.0]
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) ~[google-api-client-1.20.0.jar:1.20.0]
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) ~[google-api-client-1.20.0.jar:1.20.0]
        at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.lambda$null$0(GceInstancesServiceImpl.java:71) ~[discovery-gce-6.0.0.jar:6.0.0]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
        at org.elasticsearch.cloud.gce.util.Access.doPrivilegedIOException(Access.java:59) ~[discovery-gce-6.0.0.jar:6.0.0]
        at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.lambda$instances$2(GceInstancesServiceImpl.java:69) [discovery-gce-6.0.0.jar:6.0.0]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) [?:?]
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1494) [?:?]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) [?:?]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) [?:?]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) [?:?]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) [?:?]
        at java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:486) [?:?]
        at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.instances(GceInstancesServiceImpl.java:82) [discovery-gce-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.gce.GceUnicastHostsProvider.buildDynamicNodes(GceUnicastHostsProvider.java:132) [discovery-gce-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.UnicastZenPing.ping(UnicastZenPing.java:309) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.UnicastZenPing.ping(UnicastZenPing.java:286) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery.pingAndWait(ZenDiscovery.java:1077) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:927) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:449) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$2500(ZenDiscovery.java:90) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1286) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
    [2017-11-30T09:15:26,527][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] disabling GCE discovery. Can not get list of nodes
    [2017-11-30T09:15:29,528][WARN ][o.e.d.z.ZenDiscovery     ] [es-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={es-master-1}{Kmb4RtHaTweY-15sIVM1XA}{x6NmW8AvTbeukZY4MBhatw}{10.0.2.10}{10.0.2.10:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again

Hmmm. It sounds like GCE API does not respond and goes to timeout.
This is happening when we do:

client().instances().list(project, zoneId);

Did you restart your nodes in the meantime? Any event that we should be aware of?
What is the status of your nodes? Do you have to manually restart the failing node?

zozo6015 · November 30, 2017, 1:51pm

Hello,

No I didn't restarted the node. I have only seen that the node is unresponsive and restarted the elasticsearch service, which ended up in the situation described above. I will restart the node and see what is happening.

Peter

zozo6015 · November 30, 2017, 4:27pm

I have restarted the node and I got the same thing.

elasticsearch.yml looks like this:

cluster.name: clustername
node.name: es-master-1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: 0.0.0.0
http.port: 9200
discovery.zen.minimum_master_nodes: 2
cloud:
  gce:
    project_id: projectid
    zone: europe-west3-a
    retry: true
    max_wait: 300s
    refresh_interval: 0s
discovery:
  zen.hosts_provider: gce
node.master: true
node.data: false
node.ingest: false

dadoonet · November 30, 2017, 5:07pm

Did you change any security settings on that platform? Or network settings?

Are you able to curl the metadata endpoint of GCP API? (Can’t check the url ATM).

zozo6015 · November 30, 2017, 5:11pm

Nothing has changed on the network or security settings. Any idea what is the metadata endpoint of the GCP API?

dadoonet · November 30, 2017, 5:26pm

http://metadata.google.internal/computeMetadata/v1/

According to https://cloud.google.com/compute/docs/storing-retrieving-metadata#querying

zozo6015 · November 30, 2017, 5:30pm

actually it can:

curl -s http://metadata/computeMetadata/v1/ -H "Metadata-Flavor: Google"
instance/
oslogin/
project/

dadoonet · December 18, 2017, 3:40pm

This is really strange. Did you figure this out in the meantime? Might worth starting a new physical machine?

zozo6015 · December 18, 2017, 3:54pm

No unfortunately it was goind on and off randomly and we decided that is better to use unicast network discovery instead of the GCE. Was just not reliable for production use.

dadoonet · December 18, 2017, 4:32pm

Thanks for the feedback. That's sad to know.
I wonder what we can do to fix. Actually may be recent updates of the Google SDK fixes that behavior...

dadoonet · December 18, 2017, 4:33pm

It has been released as part of 6.0.1. Do you think you could give it a try?

zozo6015 · December 18, 2017, 4:40pm

Unfortunately we cannot upgrade the elasticsearch version yet, since it is used in production.

system · January 15, 2018, 4:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES v 6.5.4 Not enough master nodes discovered during pinging - gce discovery Elasticsearch	10	962	February 17, 2019
Elasticsearch data node fail to join the cluster Elasticsearch	15	5682	January 9, 2018
Google Cloud discovery not working Elasticsearch	9	484	February 9, 2021
GCE discovery with ES 1.7 Elasticsearch	2	490	February 14, 2018
GC - Node Freezes and all operations fails from client Elasticsearch	4	1318	July 5, 2017

Elasticsearch 6.0.0 gce discovery failing

Related topics