Node can't find cluster after restart - discovery failed [SOLVED]

Doug_Swanson · September 23, 2015, 8:05am

Hi-

Hope this is the right place to field this...

We had a node die on us earlier this morning due to memory errors. We restarted ES on that node, it joined the cluster and life was good again. The other 2 nodes of the cluster had high JVM memory usage (similar to the node that died), so being proactive I restarted one of the other good nodes and it can't join the cluster again.

We can telnet and ping to all of the nodes (by name or ip) in the cluster from the box that it won't startup on. We've restarted networking and the box, shut iptables off, etc. no joy.

Here is the Discovery config snippet from the now failing node (48):

################################## Discovery ##################################
discovery.zen.minimum_master_nodes: 2
#discovery.zen.ping.timeout: 3s    
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["x.x.x.47,x.x.x.49"]

Any thoughts or insight much appreciated. I'll try and add the logging to the topic.

Thanks
-Doug

Doug_Swanson · September 23, 2015, 8:07am

[2015-09-23 03:05:53,605][INFO ][node                     ] [es3] version[1.7.0], pid[9168], build[929b973/2015-07-16T14:31:07Z]
[2015-09-23 03:05:53,605][INFO ][node                     ] [es3] initializing ...
[2015-09-23 03:05:53,667][INFO ][plugins                  ] [es3] loaded [jdbc-1.5.0.5-da4ba96, marvel], sites [marvel]
[2015-09-23 03:05:53,693][INFO ][env                      ] [es3] using [1] data paths, mounts [[/esdata (/dev/mapper/vg_esdata-lv_esdata)]], net usable_space [694.7gb], net total_space [733.4gb], types [ext4]
[2015-09-23 03:05:55,001][DEBUG][discovery.zen.elect      ] [es3] using minimum_master_nodes [2]
[2015-09-23 03:05:55,004][DEBUG][discovery.zen.ping.unicast] [es3] using initial hosts [10.250.13.47,10.250.13.49], with concurrent_connects [10]
[2015-09-23 03:05:55,102][DEBUG][discovery.zen            ] [es3] using ping.timeout [3s], join.timeout [1m], master_election.filter_client [true], master_election.filter_data [false]
[2015-09-23 03:05:55,105][DEBUG][discovery.zen.fd         ] [es3] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2015-09-23 03:05:55,107][DEBUG][discovery.zen.fd         ] [es3] [node  ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2015-09-23 03:05:55,716][INFO ][node                     ] [es3] initialized
[2015-09-23 03:05:55,716][INFO ][node                     ] [es3] starting ...
[2015-09-23 03:05:55,969][INFO ][transport                ] [es3] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.250.13.48:9300]}
[2015-09-23 03:05:55,980][INFO ][discovery                ] [es3] es_prod/TcgypcKkTES9d5PSa54CXw
[2015-09-23 03:05:55,982][TRACE][discovery.zen            ] [es3] starting to ping
[2015-09-23 03:05:55,987][TRACE][discovery.zen.ping.unicast] [es3] [1] connecting (light) to [#zen_unicast_1#][es3.mylexia.com][inet[10.250.13.47,10.250.13.49:9300]]
[2015-09-23 03:05:55,988][TRACE][discovery.zen.ping.unicast] [es3] [1] connecting to [es3][TcgypcKkTES9d5PSa54CXw][es3.mylexia.com][inet[/10.250.13.48:9300]]
[2015-09-23 03:05:56,108][WARN ][transport.netty          ] [es3] exception caught on transport layer [[id: 0x415811db]], closing connection
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:123)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
...
[2015-09-23 03:05:56,115][TRACE][discovery.zen.ping.unicast] [es3] [1] connected to [es3][TcgypcKkTES9d5PSa54CXw][es3.mylexia.com][inet[/10.250.13.48:9300]]
[2015-09-23 03:05:56,116][TRACE][discovery.zen.ping.unicast] [es3] [1] sending to [es3][TcgypcKkTES9d5PSa54CXw][es3.mylexia.com][inet[/10.250.13.48:9300]]
[2015-09-23 03:05:56,117][TRACE][discovery.zen.ping.unicast] [es3] [1] failed to connect to [#zen_unicast_1#][es3.mylexia.com][inet[10.250.13.47,10.250.13.49:9300]]
org.elasticsearch.transport.ConnectTransportException: [][inet[10.250.13.47,10.250.13.49:9300]] connect_timeout[30s]
	at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:790)
	at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:754)
...
Caused by: java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:123)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
	at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
	at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
	at org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:634)
	at org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:216)
	at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:229)
	at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:182)
	at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:787)
	... 7 more

Doug_Swanson · September 23, 2015, 11:09pm

Yes an oh-so-simple solution. This nodes unicast list was formatted wrong. The whole list was quoted not the individual nodes. i.e. "host,host" should have been "host","host"

When all the servers were started together, you don't notice a thing because the correctly configured one covers it up, but when starting just one with the wrong config...bad things happen.

Topic		Replies	Views
Nodes're not discovered in the cluster Elasticsearch	4	382	July 6, 2017
Problem with Discovery / forming mini-clusters Elasticsearch	4	289	July 6, 2017
Can't join cluster Elasticsearch	5	458	July 6, 2017
Node automatic (re)connection when not using multicast Elasticsearch	2	351	July 6, 2017
Configuring multiple nodes Elasticsearch	19	1256	July 6, 2017

Node can't find cluster after restart - discovery failed [SOLVED]

Related topics