No route to host error with new ES cluster

(Jason) #1

I'm trying to spin up a new 5 node cluster, and I'm not getting anywhere.

All 5 nodes are running a version of this same config file, with the variations being in IP addresses of the other nodes and the hostname, as well as 2 nodes being only 'data' nodes. testcluster node1 /opt/elasticsearch/data
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true [_local_, _em1_]
network.bind_host: [_local_, _em1_]
network.publish_host: [_local_, _em1_] ["", "", "", ""]
discovery.zen.minimum_master_nodes: 3
gateway.recovery_after_nodes: 3
node.master: true true

The log errors I am seeing are all very similar to this (logging set to debug):

[2016-10-21 10:14:31,563][DEBUG][transport.netty ] [node1] using profile[default], worker_count[16], port[9300-9400], bind_host[null], publish_host[null], compress[false], connect_timeout[30s], connections_per_node[2/3/6/1/1], receive_predictor[512kb->512kb]
[2016-10-21 10:14:31,615][DEBUG][transport.netty ] [node1] binding server bootstrap to: ::1
[2016-10-21 10:14:31,638][DEBUG][transport.netty ] [node1] Bound profile [default] to address {[::1]:9300}
[2016-10-21 10:14:31,640][DEBUG][transport.netty ] [node1] Bound profile [default] to address {}
[2016-10-21 10:14:31,641][DEBUG][transport.netty ] [node1] Bound profile [default] to address {}
[2016-10-21 10:14:31,642][DEBUG][transport.netty ] [node1] Bound profile [default] to address {[fe80::250:56ff:fe01:438]:9300}
[2016-10-21 10:14:31,644][INFO ][transport ] [node1] publish_address {}, bound_addresses {[::1]:9300}, {}, {}, {[ff88::001:88ff:ff88:001]:9300}
[2016-10-21 10:14:31,649][INFO ][discovery ] [node1] testcluster/1DdD32DdDDDD-3dD4Dd52d
[2016-10-21 10:14:31,652][DEBUG][cluster.service ] [node1] processing [initial_join]: execute
[2016-10-21 10:14:31,656][DEBUG][cluster.service ] [node1] processing [initial_join]: took 3ms no change in cluster_state
[2016-10-21 10:14:31,715][WARN ][transport.netty ] [node1] exception caught on transport layer [[id: 0xc564d8c7]], closing connection No route to host
at Method)

I don't quite understand why I am seeing this. Can anyone give me a clue? Occasionally I will see the other nodes try to join. So I'll see something like this:

[2016-10-21 10:14:37,680][DEBUG][discovery.zen ] [testcluster] filtered ping responses: (filter_client[true], filter_data[false])
--> ping_response{node [{node3}{WfGuHey-RG65D9UH5ZMZNw}{}{}], id[489], master [null], hasJoinedOnce [false], cluster_name[graylog]}
[2016-10-21 10:14:37,681][WARN ][transport.netty ] [testcluster] exception caught on transport layer [[id: 0x3cab43aa]], closing connection No route to host

I have SELINUX set to Permissive. Any clue?

(Mark Walkom) #2

Can you ping and telnet the other nodes?

(Jason) #3

Yes. I could ping and telnet.

I simplified my config file, and took firewalld down, and that seemed to allow traffic to flow. Now I have to go backwards and see how I can turn firewalld back on and open up the right paths and ports.

(Jason) #4

So I will post what has allowed this to work for me.

First, I trimmed down the elasticsearch.yml file to the following: testcluster node1 /opt/elasticsearch/data
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true [_local_, _em1_]
node.master: true true ["", "", "", ""]
discovery.zen.minimum_master_nodes: 1
gateway.recover_after_nodes: 3
gateway.expected_nodes: 4
gateway.recover_after_time: 30s

Then, I executed the following commands. It's important to note that I am using Linux Red Hat 7 (RHEL7), which has the firewalld service turned on. This assumes you don't want to switch to another zone in firewalld, and don't mind putting things in the public (default) zone.

firewall-cmd --zone=public --permanent --add-port=9200-9400/tcp
firewall-cmd --zone=public --permanent --add-port=9200-9400/udp
firewall-cmd --zone=public --permanent --add-source=
firewall-cmd --reload

That ensures you have all of the appropriate ports open, and ensures that the network(s) you need access to are available.

(system) #5