Nodes Constantly Disconencted from Cluster

I have 3 Master as Hot Nodes and 3 Data Warm Nodes all with 16GB of RAM and 73 Index (3 Shards, 1 replica). However all 3 of the Warm nodes always and randomly disconnected and I can't figure out why. Ping to all those 3 nodes is normal and no timeout when the disconnection happens and all the server are in the same network.

This behavior can be observed when I'm moving and reindex indices to the Warm Nodes causing issues to the cluster.

ES version : 6.7
Logs from one of master node :

[2019-04-01T12:13:07,732][INFO ][o.e.c.s.MasterService    ] [FI-ELK6-NODE-1] zen-disco-node-failed({FI-ELK6-WarmNode1}{NBB7-Oj_ThS_jFS2vGJmaA}{KIMipXv9SbSyxZccmxifQA}{192.168.28.90}{192.168.28.90:9300}{xpack.installed=true, box_type=warm}), reason(transport disconnected)[{FI-ELK6-WarmNode1}{NBB7-Oj_ThS_jFS2vGJmaA}{KIMipXv9SbSyxZccmxifQA}{192.168.28.90}{192.168.28.90:9300}{xpack.installed=true, box_type=warm} transport disconnected], reason: removed {{FI-ELK6-WarmNode1}{NBB7-Oj_ThS_jFS2vGJmaA}{KIMipXv9SbSyxZccmxifQA}{192.168.28.90}{192.168.28.90:9300}{xpack.installed=true, box_type=warm},}
[2019-04-01T12:13:08,040][INFO ][o.e.c.s.ClusterApplierService] [FI-ELK6-NODE-1] removed {{FI-ELK6-WarmNode1}{NBB7-Oj_ThS_jFS2vGJmaA}{KIMipXv9SbSyxZccmxifQA}{192.168.28.90}{192.168.28.90:9300}{xpack.installed=true, box_type=warm},}, reason: apply cluster state (from master [master {FI-ELK6-NODE-1}{0ho7GGABRHKH7Dvi73k8BA}{tHDN2KbQRT-wFqCTM3bGdw}{192.168.28.74}{192.168.28.74:9300}{xpack.installed=true, box_type=hot} committed version [1956] source [zen-disco-node-failed({FI-ELK6-WarmNode1}{NBB7-Oj_ThS_jFS2vGJmaA}{KIMipXv9SbSyxZccmxifQA}{192.168.28.90}{192.168.28.90:9300}{xpack.installed=true, box_type=warm}), reason(transport disconnected)[{FI-ELK6-WarmNode1}{NBB7-Oj_ThS_jFS2vGJmaA}{KIMipXv9SbSyxZccmxifQA}{192.168.28.90}{192.168.28.90:9300}{xpack.installed=true, box_type=warm} transport disconnected]]])

Any help is appreciated.

BR.

I met the same problem today. Have you solved it ?

It very much looks like a connectivity issue, but it's hard to tell any more from the two short log lines that you've shared. There will be more messages, including stack traces, that will contain more information.

I'll get the logs and do some testing once I'm back to the workplace, but based from my previous observation there was no timeout when doing ping test to the master however at the same time elasticsearch logged a "[WARN ][o.e.d.z.ZenDiscovery ] [FI-ELK6-WarmNode3] master left (reason = failed to ping, tried [3] times, each with maximum [1m] timeout), current nodes: nodes:" which baffles me.

Ok, the "pings" that Elasticsearch mentions in that message are very different from the pings that the ping command sends, so a simple "ping test" is not a reliable indicator. Mostly when I've seen this sort of thing in the past it's normally been due to a misconfigured firewall. Note the following entry in the reference manual:

Elasticsearch opens a number of long-lived TCP connections between each pair of nodes in the cluster, and some of these connections may be idle for an extended period of time. Nonetheless, Elasticsearch requires these connections to remain open, and it can disrupt the operation of the cluster if any inter-node connections are closed by an external influence such as a firewall. It is important to configure your network to preserve long-lived idle connections between Elasticsearch nodes, for instance by leaving tcp.keep_alive enabled and ensuring that the keepalive interval is shorter than any timeout that might cause idle connections to be closed, or by setting transport.ping_schedule if keepalives cannot be configured.

An Elasticsearch ping is at the application level over TCP and therefore very different to an OS ping.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.