Node can't find cluster after restart - discovery failed [SOLVED]


(Doug Swanson) #1

Hi-

Hope this is the right place to field this...

We had a node die on us earlier this morning due to memory errors. We restarted ES on that node, it joined the cluster and life was good again. The other 2 nodes of the cluster had high JVM memory usage (similar to the node that died), so being proactive I restarted one of the other good nodes and it can't join the cluster again.

We can telnet and ping to all of the nodes (by name or ip) in the cluster from the box that it won't startup on. We've restarted networking and the box, shut iptables off, etc. no joy.

Here is the Discovery config snippet from the now failing node (48):

################################## Discovery ##################################
discovery.zen.minimum_master_nodes: 2
#discovery.zen.ping.timeout: 3s    
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["x.x.x.47,x.x.x.49"]

Any thoughts or insight much appreciated. I'll try and add the logging to the topic.

Thanks
-Doug


(Doug Swanson) #2
[2015-09-23 03:05:53,605][INFO ][node                     ] [es3] version[1.7.0], pid[9168], build[929b973/2015-07-16T14:31:07Z]
[2015-09-23 03:05:53,605][INFO ][node                     ] [es3] initializing ...
[2015-09-23 03:05:53,667][INFO ][plugins                  ] [es3] loaded [jdbc-1.5.0.5-da4ba96, marvel], sites [marvel]
[2015-09-23 03:05:53,693][INFO ][env                      ] [es3] using [1] data paths, mounts [[/esdata (/dev/mapper/vg_esdata-lv_esdata)]], net usable_space [694.7gb], net total_space [733.4gb], types [ext4]
[2015-09-23 03:05:55,001][DEBUG][discovery.zen.elect      ] [es3] using minimum_master_nodes [2]
[2015-09-23 03:05:55,004][DEBUG][discovery.zen.ping.unicast] [es3] using initial hosts [10.250.13.47,10.250.13.49], with concurrent_connects [10]
[2015-09-23 03:05:55,102][DEBUG][discovery.zen            ] [es3] using ping.timeout [3s], join.timeout [1m], master_election.filter_client [true], master_election.filter_data [false]
[2015-09-23 03:05:55,105][DEBUG][discovery.zen.fd         ] [es3] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2015-09-23 03:05:55,107][DEBUG][discovery.zen.fd         ] [es3] [node  ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2015-09-23 03:05:55,716][INFO ][node                     ] [es3] initialized
[2015-09-23 03:05:55,716][INFO ][node                     ] [es3] starting ...
[2015-09-23 03:05:55,969][INFO ][transport                ] [es3] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.250.13.48:9300]}
[2015-09-23 03:05:55,980][INFO ][discovery                ] [es3] es_prod/TcgypcKkTES9d5PSa54CXw
[2015-09-23 03:05:55,982][TRACE][discovery.zen            ] [es3] starting to ping
[2015-09-23 03:05:55,987][TRACE][discovery.zen.ping.unicast] [es3] [1] connecting (light) to [#zen_unicast_1#][es3.mylexia.com][inet[10.250.13.47,10.250.13.49:9300]]
[2015-09-23 03:05:55,988][TRACE][discovery.zen.ping.unicast] [es3] [1] connecting to [es3][TcgypcKkTES9d5PSa54CXw][es3.mylexia.com][inet[/10.250.13.48:9300]]
[2015-09-23 03:05:56,108][WARN ][transport.netty          ] [es3] exception caught on transport layer [[id: 0x415811db]], closing connection
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:123)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
...
[2015-09-23 03:05:56,115][TRACE][discovery.zen.ping.unicast] [es3] [1] connected to [es3][TcgypcKkTES9d5PSa54CXw][es3.mylexia.com][inet[/10.250.13.48:9300]]
[2015-09-23 03:05:56,116][TRACE][discovery.zen.ping.unicast] [es3] [1] sending to [es3][TcgypcKkTES9d5PSa54CXw][es3.mylexia.com][inet[/10.250.13.48:9300]]
[2015-09-23 03:05:56,117][TRACE][discovery.zen.ping.unicast] [es3] [1] failed to connect to [#zen_unicast_1#][es3.mylexia.com][inet[10.250.13.47,10.250.13.49:9300]]
org.elasticsearch.transport.ConnectTransportException: [][inet[10.250.13.47,10.250.13.49:9300]] connect_timeout[30s]
	at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:790)
	at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:754)
...
Caused by: java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:123)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
	at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
	at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
	at org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:634)
	at org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:216)
	at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:229)
	at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:182)
	at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:787)
	... 7 more

(Doug Swanson) #3

Yes an oh-so-simple solution. This nodes unicast list was formatted wrong. The whole list was quoted not the individual nodes. i.e. "host,host" should have been "host","host"

When all the servers were started together, you don't notice a thing because the correctly configured one covers it up, but when starting just one with the wrong config...bad things happen.


(system) #4