Problem communicating within nodes in cluster - send message failed, node gets removed from cluster

Hi,

We keep seeing this issue intermittently in our cluster.. After the message failed error comes, the node gets removed from the cluster and the cluster state goes into red..

Immediately afterwards, the node gets added back into the cluster.
This happens a number of times, each hour, though there is no specified frequency. Is there a way we can work around this issue?

[2018-11-14T11:59:16,974][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [es-node1-001] send message failed [channel: NettyTcpChannel{localAddress=/173.37.96.31:9300, remoteAddress=/173.36.39.60:50426}]
javax.net.ssl.SSLException: SSLEngine closed already
at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?]
[2018-11-14T11:59:16,974][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [es-node1-001] send message failed [channel: NettyTcpChannel{localAddress=/173.37.96.31:9300, remoteAddress=/173.36.39.60:50426}]
javax.net.ssl.SSLException: SSLEngine closed already
at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?]
[2018-11-14T12:00:08,117][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [es-node1-001] exception caught on transport layer [NettyTcpChannel{localAddress=/173.37.96.31:9300, remoteAddress=/173.36.39.60:50672}], closing connection
[2018-11-14T12:07:02,344][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [es-node1-001] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/173.36.39.60:50730}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-11-14T12:08:31,298][INFO ][o.e.c.s.ClusterApplierService] [es-node1-001] removed {{es-node2-001}{V4A7hjvtQcyFW7BlxG-j4w}{3GXiTiZ_QfCfFtprDhA2og}{es-node2-001}{173.36.39.60:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {es-node1-002}{44pA9ErPTb-y3zOylW8Z_Q}{byFkKyl5S1a_s3RiujXyRA}{es-node1-002}{173.37.96.32:9300}{xpack.installed=true} committed version [148]])
[2018-11-14T12:08:31,809][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [es-node1-001] failed to execute on node [V4A7hjvtQcyFW7BlxG-j4w]
org.elasticsearch.transport.NodeDisconnectedException: [es-node2-001][173.36.39.60:9300][cluster:monitor/nodes/stats[n]] disconnected
[2018-11-14T12:08:58,991][INFO ][o.e.c.s.ClusterApplierService] [es-node1-001] added {{es-node2-001}{V4A7hjvtQcyFW7BlxG-j4w}{3GXiTiZ_QfCfFtprDhA2og}{es-node2-001}{173.36.39.60:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {es-node1-002}{44pA9ErPTb-y3zOylW8Z_Q}{byFkKyl5S1a_s3RiujXyRA}{es-node1-002}{173.37.96.32:9300}{xpack.installed=true} committed version [152]])

Are those the logs from the node that disappears?

Hi,
Thats right.. these are from the node that keeps popping out of the cluster

Something like this too saw it once

[2018-11-14T23:04:48,723][WARN ][o.e.t.TransportService ] [es-node1-002] Received response for a request that has timed out, sent [82838ms] ago, timed out [52837ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-node2-002}{rMYDt_5ITuCTJbjUPluLJA}{Quk0hNopRJ-CZJyQz81AAg}{es-node2-002}{173.36.39.61:9300}{ml.machine_memory=67556810752, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [536]

@elco_comm1982, Can you please let me know how many master node you have in your cluster? And what the value of discovery.zen.minimum_master_nodes have you set in your elasticsearch.yml file.

Thanks.

Hi, we have 4 master eligible nodes in the cluster. The cluster is spread across 2 data centers, with each data center having 2 nodes. The value of discovery.zen.minimum_master_nodes is 2. After doing a number of things, i found the overridden value of thread_pool.write.queue_size, thread_pool.index.queue_size was causing the issue.. Once i removed it, the cluster was stable.

That is not good. As per these guidelines, you should have minimum_master_nodes set to 3 as you have 4 master-eligible nodes in order to avoid split-brain scenarios.

Are these datacentres spread far apart?

1 Like

I think that's a typo, minimum_master_nodes must be at least 3 if there are 4 master-eligible nodes. If you leave it at 2 then it's only a matter of time before a brief connectivity loss between your two data centres leads to data loss.

Thanks for spotting this. Have corrected it.

1 Like

Thanks.. Will do this

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.