Cluster nodes unable to communicate with each other

rao-uno · April 12, 2019, 2:18pm

Hi,
I have a cluster of 3 nodes. The nodes storage got full (91%) and no new index were being created. But then my servers hanged and had to be rebooted. Now when I am starting the cluster, the cluster is not able to identify any other nodes so all my indices' health status is RED. The storage space is still full and there are unassigned replicas. Can anyone please help me in getting my cluster communication. There has been no change in host names and IPs.

Thank you

DavidTurner · April 12, 2019, 2:34pm

cluster.routing.allocation.disk.watermark.high defaults to 90%, and if a node is above this level then it will not allocate any primaries. As a short-term fix you can increase cluster.routing.allocation.disk.watermark.high to, say, 92%, but you will need to free up some space or purchase some more storage in the very near future to resolve this properly.

rao-uno · April 12, 2019, 6:09pm

Thank you David, but I had a doubt, would this allow my nodes to be able to join and form the cluster? Currently my cluster status only shows one node (that node itself) on the 3 elasticsearch nodes. Is the high disk watermark not allowing them to form the cluster?

DavidTurner · April 12, 2019, 6:50pm

No, the disk watermark wouldn't be doing that. Do you have discovery.zen.minimum_master_nodes set to 2 on every node?

rao-uno · April 13, 2019, 8:33am

Yes it was set to 2, but I kept getting error , Not enough master nodes discovered during pinging.

DavidTurner · April 13, 2019, 8:51am

Can you share the logs from all the nodes? Without more information we can only really speculate.

rao-uno · April 14, 2019, 7:27am

These are some of the logs from my node

[2019-04-12T10:54:46,763][DEBUG][o.e.a.ActionModule       ] [elk-node-1] Using REST wrapper from plugin org.elasticsearch.xpack.security.Security
[2019-04-12T10:54:47,798][INFO ][o.e.d.DiscoveryModule    ] [elk-node-1] using discovery type [zen] and host providers [settings]
[2019-04-12T10:54:49,784][INFO ][o.e.n.Node               ] [elk-node-1] initialized
[2019-04-12T10:54:49,785][INFO ][o.e.n.Node               ] [elk-node-1] starting ...
[2019-04-12T10:54:50,186][INFO ][o.e.t.TransportService   ] [elk-node-1] publish_address {xx.xx.xx.xx:9300}, bound_addresses {[::]:9300}
[2019-04-12T10:54:50,401][INFO ][o.e.b.BootstrapChecks    ] [elk-node-1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-04-12T10:54:53,509][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:54:56,513][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:54:59,516][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:55:02,520][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:55:05,525][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:55:35,554][WARN ][o.e.d.z.ZenDiscovery     ] [elk-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={elk-node-1}{9uVegm9tSyG2HiObIzzZzw}{UENmmjH2TrGS-EfiVK3Lhg}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}{ml.machine_memory=8186187776, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2019-04-12T10:55:35,948][WARN ][r.suppressed             ] [elk-node-1] path: /.reporting-*/esqueue/_search, params: {index=.reporting-*, type=esqueue, version=true}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:166) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:152) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:297) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.search.TransportSearchAction.lambda$doExecute$4(TransportSearchAction.java:193) ~[elasticsearch-6.5.4.jar:6.5.4]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.5.4.jar:6.5.4]

DavidTurner · April 14, 2019, 7:31am

Thanks @rao-uno, it looks like the nodes just don't know about each other. Have you configured discovery.zen.ping.unicast.hosts correctly on each node? It should contain the hostnames or IP addresses of each master-eligible node in the cluster.

rao-uno · April 14, 2019, 7:36am

Yes, I have not made any changes to my config files after the reboot and they have the correct host IPs. I even checked netstat -plnt to check the ports being listened to and found 9200 and 9300 ports listed there.

DavidTurner · April 14, 2019, 7:59am

Ok, this is strange. Could you set logger.org.elasticsearch.discovery.zen.UnicastZenPing: TRACE and restart a node to get more detail on what's going wrong with discovery?

rao-uno · April 14, 2019, 8:29am

Okay sure. I will set this configuration and get more details on it by tomorrow since I don't have access to the nodes now.

rao-uno · April 15, 2019, 7:58am

How do I set this configuration? Do I add this in elasticsearch.yml or is there any API for it?

DavidTurner · April 15, 2019, 8:01am

Yes, put that line in elasticsearch.yml. There is also an API, but it only works after the cluster has formed so it's of no use to you

rao-uno · April 15, 2019, 8:34am

[2019-04-15T13:53:31,026][TRACE][o.e.d.z.UnicastZenPing   ] [elk-node-1] [31] failed to ping {xx.xx.xx.xx:9300}{gs3ICmKjS06xWT4TrJ2rIw}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}
org.elasticsearch.transport.ConnectTransportException: [][xx.xx.xx.xx:9300] connect_exception
        at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:165) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:454) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:117) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:237) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:95) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:393) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.discovery.zen.UnicastZenPing$PingingRound.getOrConnect(UnicastZenPing.java:364) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.discovery.zen.UnicastZenPing$3.doRun(UnicastZenPing.java:471) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.4.jar:6.5.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_102]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /xx.xx.xx.xx:9300
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) ~[?:?]
        ... 1 more
Caused by: java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) ~[?:?]

This is the error I'm getting with the TRACE option. I have sanitized the logs(replaced my IP with 'x').
I pinged the server (without the port) and was able to ping it.

I have one doubt, the port is listed astcp6 in the netstat -plnt output, is this correct? And am I missing something else in the logs?

DavidTurner · April 15, 2019, 9:02am

Ok, this is strange. No route to host is a pretty low-level error indicating something is very wrong at the network level, but I wouldn't normally expect a ping to work in the presence of this error either. Can you double (triple) check that the IP address is right? ping does something quite different from what Elasticsearch is trying to do; curl http://xx.xx.xx.xx:9300/ is closer, and should yield This is not an HTTP port if successful.

The next thing I think I'd try is using tcpdump to get a packet capture during a pinging round in order to see exactly what's going on with the network:

sudo tcpdump -i <INTERFACE> -s 65535 -w capture-$(date +%s).cap

You can then open the capture file in Wireshark, or you can send it to me at david.turner@elastic.co and I'll take a look.

I think that's ok, see this StackOverflow answer for example.

rao-uno · April 15, 2019, 10:07am

Hi David,
I'm getting Failed to connect to xx.xx.xx.xx:9300; No route to host when I'm running curl http://xx.xx.xx.xx:9300/

rao-uno · April 15, 2019, 10:12am

The tcpdump is giving Destination unreachable (Host administratively prohibited) for the cluster servers

DavidTurner · April 15, 2019, 10:24am

Ok, that sounds like a firewall rule is preventing these nodes from communicating. For instance there is an iptables rule set up to reject this traffic with -j REJECT --reject-with icmp-host-prohibited.

rao-uno · April 15, 2019, 10:37am

Yes, I do see this mentioned in iptables -L.
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited

system · May 13, 2019, 10:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster 2 nodes no communication Elasticsearch	5	998	December 8, 2017
Not able to create a cluster in Elasticsearch Elasticsearch	2	694	June 28, 2019
Unable to communicate between 2 master nodes creating cluster issue Elasticsearch	5	311	April 27, 2023
Unable to make my cluster state as Green even with more than 1 nodes Elasticsearch	5	681	July 5, 2017
Unable to communicate between the two elasticsearch clusters of different datacenters Elasticsearch	3	1766	May 22, 2018

Cluster nodes unable to communicate with each other

Related topics