Elasticsearch 6.0.0 gce discovery failing

Hello,

I have setup an elasticsearch cluster on GCE with gce discovery plugin. Yesterday everything was OK and the nodes joined the cluster correctly but when I checked today I have noticed that the nodes are disconnected due to the GCE plugin failing.

Here are what I can see in the logs:

https://pastebin.com/S7GxdSjp

Any idea how to fix it?

Regards,
Peter

[2017-11-30T09:10:25,684][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] disabling GCE discovery. Can not get list of nodes
    [2017-11-30T09:10:28,685][WARN ][o.e.d.z.ZenDiscovery     ] [es-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={es-master-1}{Kmb4RtHaTweY-15sIVM1XA}{x6NmW8AvTbeukZY4MBhatw}{10.0.2.10}{10.0.2.10:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
    [2017-11-30T09:15:26,524][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] Problem fetching instance list for zone europe-west3-a
    java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:400) ~[?:?]
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:243) ~[?:?]
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:225) ~[?:?]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:402) ~[?:?]
        at java.net.Socket.connect(Socket.java:591) ~[?:?]
        at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:657) ~[?:?]
        at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?]
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?]
        at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265) ~[?:?]
        at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:372) ~[?:?]
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1181) ~[?:?]
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1075) ~[?:?]
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) ~[?:?]
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:163) ~[?:?]
        at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93) ~[google-http-client-1.20.0.jar:1.20.0]
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972) ~[google-http-client-1.20.0.jar:1.20.0]
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) ~[google-api-client-1.20.0.jar:1.20.0]
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) ~[google-api-client-1.20.0.jar:1.20.0]
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) ~[google-api-client-1.20.0.jar:1.20.0]
        at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.lambda$null$0(GceInstancesServiceImpl.java:71) ~[discovery-gce-6.0.0.jar:6.0.0]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
        at org.elasticsearch.cloud.gce.util.Access.doPrivilegedIOException(Access.java:59) ~[discovery-gce-6.0.0.jar:6.0.0]
        at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.lambda$instances$2(GceInstancesServiceImpl.java:69) [discovery-gce-6.0.0.jar:6.0.0]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) [?:?]
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1494) [?:?]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) [?:?]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) [?:?]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) [?:?]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) [?:?]
        at java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:486) [?:?]
        at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.instances(GceInstancesServiceImpl.java:82) [discovery-gce-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.gce.GceUnicastHostsProvider.buildDynamicNodes(GceUnicastHostsProvider.java:132) [discovery-gce-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.UnicastZenPing.ping(UnicastZenPing.java:309) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.UnicastZenPing.ping(UnicastZenPing.java:286) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery.pingAndWait(ZenDiscovery.java:1077) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:927) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:449) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$2500(ZenDiscovery.java:90) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1286) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
    [2017-11-30T09:15:26,527][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] disabling GCE discovery. Can not get list of nodes
    [2017-11-30T09:15:29,528][WARN ][o.e.d.z.ZenDiscovery     ] [es-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={es-master-1}{Kmb4RtHaTweY-15sIVM1XA}{x6NmW8AvTbeukZY4MBhatw}{10.0.2.10}{10.0.2.10:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again

Hmmm. It sounds like GCE API does not respond and goes to timeout.
This is happening when we do:

client().instances().list(project, zoneId);

Did you restart your nodes in the meantime? Any event that we should be aware of?
What is the status of your nodes? Do you have to manually restart the failing node?

Hello,

No I didn't restarted the node. I have only seen that the node is unresponsive and restarted the elasticsearch service, which ended up in the situation described above. I will restart the node and see what is happening.

Peter

I have restarted the node and I got the same thing.

elasticsearch.yml looks like this:

cluster.name: clustername
node.name: es-master-1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: 0.0.0.0
http.port: 9200
discovery.zen.minimum_master_nodes: 2
cloud:
  gce:
    project_id: projectid
    zone: europe-west3-a
    retry: true
    max_wait: 300s
    refresh_interval: 0s
discovery:
  zen.hosts_provider: gce
node.master: true
node.data: false
node.ingest: false

Did you change any security settings on that platform? Or network settings?

Are you able to curl the metadata endpoint of GCP API? (Can’t check the url ATM).

Nothing has changed on the network or security settings. Any idea what is the metadata endpoint of the GCP API?

http://metadata.google.internal/computeMetadata/v1/

According to https://cloud.google.com/compute/docs/storing-retrieving-metadata#querying

actually it can:

curl -s http://metadata/computeMetadata/v1/ -H "Metadata-Flavor: Google"
instance/
oslogin/
project/

This is really strange. Did you figure this out in the meantime? Might worth starting a new physical machine?

No unfortunately it was goind on and off randomly and we decided that is better to use unicast network discovery instead of the GCE. Was just not reliable for production use.

Thanks for the feedback. That's sad to know.
I wonder what we can do to fix. Actually may be recent updates of the Google SDK fixes that behavior...

It has been released as part of 6.0.1. Do you think you could give it a try?

Unfortunately we cannot upgrade the elasticsearch version yet, since it is used in production.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.