I have setup an elasticsearch cluster on GCE with gce discovery plugin. Yesterday everything was OK and the nodes joined the cluster correctly but when I checked today I have noticed that the nodes are disconnected due to the GCE plugin failing.
[2017-11-30T09:10:25,684][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] disabling GCE discovery. Can not get list of nodes
[2017-11-30T09:10:28,685][WARN ][o.e.d.z.ZenDiscovery ] [es-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={es-master-1}{Kmb4RtHaTweY-15sIVM1XA}{x6NmW8AvTbeukZY4MBhatw}{10.0.2.10}{10.0.2.10:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2017-11-30T09:15:26,524][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] Problem fetching instance list for zone europe-west3-a
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:400) ~[?:?]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:243) ~[?:?]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:225) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:402) ~[?:?]
at java.net.Socket.connect(Socket.java:591) ~[?:?]
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:657) ~[?:?]
at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?]
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265) ~[?:?]
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:372) ~[?:?]
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) ~[?:?]
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1181) ~[?:?]
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1075) ~[?:?]
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) ~[?:?]
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:163) ~[?:?]
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93) ~[google-http-client-1.20.0.jar:1.20.0]
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972) ~[google-http-client-1.20.0.jar:1.20.0]
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) ~[google-api-client-1.20.0.jar:1.20.0]
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) ~[google-api-client-1.20.0.jar:1.20.0]
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) ~[google-api-client-1.20.0.jar:1.20.0]
at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.lambda$null$0(GceInstancesServiceImpl.java:71) ~[discovery-gce-6.0.0.jar:6.0.0]
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
at org.elasticsearch.cloud.gce.util.Access.doPrivilegedIOException(Access.java:59) ~[discovery-gce-6.0.0.jar:6.0.0]
at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.lambda$instances$2(GceInstancesServiceImpl.java:69) [discovery-gce-6.0.0.jar:6.0.0]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) [?:?]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1494) [?:?]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) [?:?]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) [?:?]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) [?:?]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) [?:?]
at java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:486) [?:?]
at org.elasticsearch.cloud.gce.GceInstancesServiceImpl.instances(GceInstancesServiceImpl.java:82) [discovery-gce-6.0.0.jar:6.0.0]
at org.elasticsearch.discovery.gce.GceUnicastHostsProvider.buildDynamicNodes(GceUnicastHostsProvider.java:132) [discovery-gce-6.0.0.jar:6.0.0]
at org.elasticsearch.discovery.zen.UnicastZenPing.ping(UnicastZenPing.java:309) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.discovery.zen.UnicastZenPing.ping(UnicastZenPing.java:286) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.pingAndWait(ZenDiscovery.java:1077) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:927) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:449) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.access$2500(ZenDiscovery.java:90) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1286) [elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
[2017-11-30T09:15:26,527][WARN ][o.e.c.g.GceInstancesServiceImpl] [es-master-1] disabling GCE discovery. Can not get list of nodes
[2017-11-30T09:15:29,528][WARN ][o.e.d.z.ZenDiscovery ] [es-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={es-master-1}{Kmb4RtHaTweY-15sIVM1XA}{x6NmW8AvTbeukZY4MBhatw}{10.0.2.10}{10.0.2.10:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
Hmmm. It sounds like GCE API does not respond and goes to timeout.
This is happening when we do:
client().instances().list(project, zoneId);
Did you restart your nodes in the meantime? Any event that we should be aware of?
What is the status of your nodes? Do you have to manually restart the failing node?
No I didn't restarted the node. I have only seen that the node is unresponsive and restarted the elasticsearch service, which ended up in the situation described above. I will restart the node and see what is happening.
No unfortunately it was goind on and off randomly and we decided that is better to use unicast network discovery instead of the GCE. Was just not reliable for production use.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.