Cluster nodes unable to communicate with each other

Hi,
I have a cluster of 3 nodes. The nodes storage got full (91%) and no new index were being created. But then my servers hanged and had to be rebooted. Now when I am starting the cluster, the cluster is not able to identify any other nodes so all my indices' health status is RED. The storage space is still full and there are unassigned replicas. Can anyone please help me in getting my cluster communication. There has been no change in host names and IPs.

Thank you

cluster.routing.allocation.disk.watermark.high defaults to 90%, and if a node is above this level then it will not allocate any primaries. As a short-term fix you can increase cluster.routing.allocation.disk.watermark.high to, say, 92%, but you will need to free up some space or purchase some more storage in the very near future to resolve this properly.

Thank you David, but I had a doubt, would this allow my nodes to be able to join and form the cluster? Currently my cluster status only shows one node (that node itself) on the 3 elasticsearch nodes. Is the high disk watermark not allowing them to form the cluster?

No, the disk watermark wouldn't be doing that. Do you have discovery.zen.minimum_master_nodes set to 2 on every node?

Yes it was set to 2, but I kept getting error , Not enough master nodes discovered during pinging.

Can you share the logs from all the nodes? Without more information we can only really speculate.

These are some of the logs from my node

[2019-04-12T10:54:46,763][DEBUG][o.e.a.ActionModule       ] [elk-node-1] Using REST wrapper from plugin org.elasticsearch.xpack.security.Security
[2019-04-12T10:54:47,798][INFO ][o.e.d.DiscoveryModule    ] [elk-node-1] using discovery type [zen] and host providers [settings]
[2019-04-12T10:54:49,784][INFO ][o.e.n.Node               ] [elk-node-1] initialized
[2019-04-12T10:54:49,785][INFO ][o.e.n.Node               ] [elk-node-1] starting ...
[2019-04-12T10:54:50,186][INFO ][o.e.t.TransportService   ] [elk-node-1] publish_address {xx.xx.xx.xx:9300}, bound_addresses {[::]:9300}
[2019-04-12T10:54:50,401][INFO ][o.e.b.BootstrapChecks    ] [elk-node-1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-04-12T10:54:53,509][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:54:56,513][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:54:59,516][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:55:02,520][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:55:05,525][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:55:35,554][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:55:35,948][WARN ][r.suppressed             ] [elk-node-1] path: /.reporting-*/esqueue/_search, params: {index=.reporting-*, type=esqueue, version=true}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:166) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:152) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:297) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.search.TransportSearchAction.lambda$doExecute$4(TransportSearchAction.java:193) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.5.4.jar:6.5.4]

Thanks @rao-uno, it looks like the nodes just don't know about each other. Have you configured discovery.zen.ping.unicast.hosts correctly on each node? It should contain the hostnames or IP addresses of each master-eligible node in the cluster.

Yes, I have not made any changes to my config files after the reboot and they have the correct host IPs. I even checked netstat -plnt to check the ports being listened to and found 9200 and 9300 ports listed there.

Ok, this is strange. Could you set logger.org.elasticsearch.discovery.zen.UnicastZenPing: TRACE and restart a node to get more detail on what's going wrong with discovery?

Okay sure. I will set this configuration and get more details on it by tomorrow since I don't have access to the nodes now.

1 Like

How do I set this configuration? Do I add this in elasticsearch.yml or is there any API for it?

Yes, put that line in elasticsearch.yml. There is also an API, but it only works after the cluster has formed so it's of no use to you :frowning:

[2019-04-15T13:53:31,026][TRACE][o.e.d.z.UnicastZenPing   ] [elk-node-1] [31] failed to ping {xx.xx.xx.xx:9300}{gs3ICmKjS06xWT4TrJ2rIw}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}
org.elasticsearch.transport.ConnectTransportException: [][xx.xx.xx.xx:9300] connect_exception
        at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:165) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:454) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:117) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:237) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:95) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:393) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.discovery.zen.UnicastZenPing$PingingRound.getOrConnect(UnicastZenPing.java:364) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.discovery.zen.UnicastZenPing$3.doRun(UnicastZenPing.java:471) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.4.jar:6.5.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_102]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /xx.xx.xx.xx:9300
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) ~[?:?]
        ... 1 more
Caused by: java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) ~[?:?]

This is the error I'm getting with the TRACE option. I have sanitized the logs(replaced my IP with 'x').
I pinged the server (without the port) and was able to ping it.

I have one doubt, the port is listed astcp6 in the netstat -plnt output, is this correct? And am I missing something else in the logs?

Ok, this is strange. No route to host is a pretty low-level error indicating something is very wrong at the network level, but I wouldn't normally expect a ping to work in the presence of this error either. Can you double (triple) check that the IP address is right? ping does something quite different from what Elasticsearch is trying to do; curl http://xx.xx.xx.xx:9300/ is closer, and should yield This is not an HTTP port if successful.

The next thing I think I'd try is using tcpdump to get a packet capture during a pinging round in order to see exactly what's going on with the network:

sudo tcpdump -i <INTERFACE> -s 65535 -w capture-$(date +%s).cap

You can then open the capture file in Wireshark, or you can send it to me at david.turner@elastic.co and I'll take a look.

I think that's ok, see this StackOverflow answer for example.

Hi David,
I'm getting Failed to connect to xx.xx.xx.xx:9300; No route to host when I'm running curl http://xx.xx.xx.xx:9300/

The tcpdump is giving Destination unreachable (Host administratively prohibited) for the cluster servers

Ok, that sounds like a firewall rule is preventing these nodes from communicating. For instance there is an iptables rule set up to reject this traffic with -j REJECT --reject-with icmp-host-prohibited.

Yes, I do see this mentioned in iptables -L.
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.