No route to host error with new ES cluster

I'm trying to spin up a new 5 node cluster, and I'm not getting anywhere.

All 5 nodes are running a version of this same config file, with the variations being in IP addresses of the other nodes and the hostname, as well as 2 nodes being only 'data' nodes.

cluster.name: testcluster
node.name: node1
path.data: /opt/elasticsearch/data
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: [_local_, _em1_]
network.bind_host: [_local_, _em1_]
network.publish_host: [_local_, _em1_]
discovery.zen.ping.unicast.hosts: ["10.0.1.2", "10.0.1.3", "10.0.1.4", "10.0.1.5"]
discovery.zen.minimum_master_nodes: 3
gateway.recovery_after_nodes: 3
node.master: true
node.data: true

The log errors I am seeing are all very similar to this (logging set to debug):

[2016-10-21 10:14:31,563][DEBUG][transport.netty ] [node1] using profile[default], worker_count[16], port[9300-9400], bind_host[null], publish_host[null], compress[false], connect_timeout[30s], connections_per_node[2/3/6/1/1], receive_predictor[512kb->512kb]
[2016-10-21 10:14:31,615][DEBUG][transport.netty ] [node1] binding server bootstrap to: ::1
[2016-10-21 10:14:31,638][DEBUG][transport.netty ] [node1] Bound profile [default] to address {[::1]:9300}
[2016-10-21 10:14:31,640][DEBUG][transport.netty ] [node1] Bound profile [default] to address {127.0.0.1:9300}
[2016-10-21 10:14:31,641][DEBUG][transport.netty ] [node1] Bound profile [default] to address {10.0.1.1:9300}
[2016-10-21 10:14:31,642][DEBUG][transport.netty ] [node1] Bound profile [default] to address {[fe80::250:56ff:fe01:438]:9300}
[2016-10-21 10:14:31,644][INFO ][transport ] [node1] publish_address {10.0.1.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}, {10.0.1.1:9300}, {[ff88::001:88ff:ff88:001]:9300}
[2016-10-21 10:14:31,649][INFO ][discovery ] [node1] testcluster/1DdD32DdDDDD-3dD4Dd52d
[2016-10-21 10:14:31,652][DEBUG][cluster.service ] [node1] processing [initial_join]: execute
[2016-10-21 10:14:31,656][DEBUG][cluster.service ] [node1] processing [initial_join]: took 3ms no change in cluster_state
[2016-10-21 10:14:31,715][WARN ][transport.netty ] [node1] exception caught on transport layer [[id: 0xc564d8c7]], closing connection
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

I don't quite understand why I am seeing this. Can anyone give me a clue? Occasionally I will see the other nodes try to join. So I'll see something like this:

[2016-10-21 10:14:37,680][DEBUG][discovery.zen ] [testcluster] filtered ping responses: (filter_client[true], filter_data[false])
--> ping_response{node [{node3}{WfGuHey-RG65D9UH5ZMZNw}{10.0.1.2}{10.0.1.2:9300}], id[489], master [null], hasJoinedOnce [false], cluster_name[graylog]}
[2016-10-21 10:14:37,681][WARN ][transport.netty ] [testcluster] exception caught on transport layer [[id: 0x3cab43aa]], closing connection
java.net.NoRouteToHostException: No route to host

I have SELINUX set to Permissive. Any clue?

Can you ping and telnet the other nodes?

Yes. I could ping and telnet.

I simplified my config file, and took firewalld down, and that seemed to allow traffic to flow. Now I have to go backwards and see how I can turn firewalld back on and open up the right paths and ports.

So I will post what has allowed this to work for me.

First, I trimmed down the elasticsearch.yml file to the following:

cluster.name: testcluster
node.name: node1
path.data: /opt/elasticsearch/data
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: [_local_, _em1_]
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["10.0.1.2:9300", "10.0.1.3:9300", "10.0.1.4:9300", "10.0.1.5:9300"]
discovery.zen.minimum_master_nodes: 1
gateway.recover_after_nodes: 3
gateway.expected_nodes: 4
gateway.recover_after_time: 30s

Then, I executed the following commands. It's important to note that I am using Linux Red Hat 7 (RHEL7), which has the firewalld service turned on. This assumes you don't want to switch to another zone in firewalld, and don't mind putting things in the public (default) zone.

firewall-cmd --zone=public --permanent --add-port=9200-9400/tcp
firewall-cmd --zone=public --permanent --add-port=9200-9400/udp
firewall-cmd --zone=public --permanent --add-source=10.0.1.0/24
firewall-cmd --reload

That ensures you have all of the appropriate ports open, and ensures that the network(s) you need access to are available.

3 Likes