ES upgrade 1.6.2 to 2.1.0 doesn't bind to desired address


(Dylan Humphreys) #1

Hi Everyone,
we're currently running elasticsearch 1.7.3 and wanting to upgrade to 2.1.1
The problem Im having is that 2.x seems to have changed the default behaviour for bind to ips. As such, elasticsearch is ONLY binding to one address, and not all of the addresses on a host.

This is our current config (or lack there of):
# grep "network." /etc/elasticsearch/elasticsearch.yml
#network.bind_host: 192.168.0.1
#network.publish_host: 192.168.0.1
#network.host: 192.168.0.1

In 1.7.3 this gives us the default and desired behaviour:

# netstat -tulpn | grep java
tcp6       0      0 :::9300                 :::*                    LISTEN      13275/java
tcp6       0      0 :::9200                 :::*                    LISTEN      13275/java

We have this network set up on each node, with each node using 61.62.63 respectively:

 ~ # ip -4 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
       valid_lft forever preferred_lft forever
    inet 192.168.0.60/32 scope global lo
       valid_lft forever preferred_lft forever
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    inet 192.168.0.61/24 brd 192.168.0.255 scope global bond0
       valid_lft forever preferred_lft forever

.60 is the service ip (we use keepalived to make sure requests from kibana go to which ever node is still up... assuming a still functioning cluster.) Keepalived is on a completely separate host.

Once we Upgrade to 2.1.1 using this guide https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-upgrade.html
It ONLY binds to 192.168.0.60

As such, the nodes dont communicate with other, and we get this:

curl -s http://192.168.0.60:9200/_cluster/health?pretty
{
  "error" : {
    "root_cause" : [ {
      "type" : "master_not_discovered_exception",
      "reason" : "waited for [30s]"
    } ],
    "type" : "master_not_discovered_exception",
    "reason" : "waited for [30s]"
  },
  "status" : 503
}

on all nodes.

What I have tried.

network.bind_host = [ "192.168.0.6x",  "192.168.0.60", "127.0.0.1" ]
network.publish_host = "192.168.0.6x"

Where x is the correct byte for the node in question.

And combinations of the above, however we always get the same result. (master_not_discovered)

All of the nodes are on the same subnet, and there are no firewalls to get in the way. I even verified that the nodes can connect to the relevant ports by telnetting them.

The logs show this:

[2016-01-22 11:20:46,003][DEBUG][discovery.zen            ] [node3] filtered ping responses: (filter_client[true], filter_data[false]) {none}
[2016-01-22 11:20:53,603][DEBUG][discovery.zen            ] [node3] filtered ping responses: (filter_client[true], filter_data[false]) {none}
[2016-01-22 11:21:01,123][DEBUG][discovery.zen            ] [node3] filtered ping responses: (filter_client[true], filter_data[false]) {none}
[2016-01-22 11:21:08,448][WARN ][discovery                ] [node3] waited for 30s and no initial state was set by the discovery
[2016-01-22 11:21:08,453][DEBUG][cluster.service          ] [node3] processing [gateway_initial_state_recovery]: execute
[2016-01-22 11:21:08,453][DEBUG][cluster.service          ] [node3] processing [gateway_initial_state_recovery]: took 0s no change in cluster_state
[2016-01-22 11:21:08,470][DEBUG][http.netty               ] [node3] Bound http to address {192.168.0.60:9200}
[2016-01-22 11:21:08,472][DEBUG][http.netty               ] [node3] Bound http to address {127.0.0.1:9200}
.....
[2016-01-22 11:22:56,566][WARN ][discovery.zen.ping.unicast] [node3] failed to send ping to [{node3}{FhHZuCAkQEOikQO48eIWhg}{192.168.156.63}{192.168.0.63:9300}{master=true}]

Which seems to imply it cant talk to itself...

Ideally, Id like to recreate the current 1.7.3 behaviour, but listening on 127.0.0.1 is not required. Listening on the nodes eth0 ip AND the service ip (on loopback) is a must however.

Any pointers greatly appreciated.
Thanks in advance.

Dylan


(Magnus B├Ąck) #2

Setting network.bind.host to 0.0.0.0 should make it listen on all interfaces.


(system) #3