Problem communicating within nodes in cluster - send message failed, node gets removed from cluster

elco_comm1982 · November 14, 2018, 8:16pm

Hi,

We keep seeing this issue intermittently in our cluster.. After the message failed error comes, the node gets removed from the cluster and the cluster state goes into red..

Immediately afterwards, the node gets added back into the cluster.
This happens a number of times, each hour, though there is no specified frequency. Is there a way we can work around this issue?

[2018-11-14T11:59:16,974][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [es-node1-001] send message failed [channel: NettyTcpChannel{localAddress=/173.37.96.31:9300, remoteAddress=/173.36.39.60:50426}]
javax.net.ssl.SSLException: SSLEngine closed already
at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?]
[2018-11-14T11:59:16,974][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [es-node1-001] send message failed [channel: NettyTcpChannel{localAddress=/173.37.96.31:9300, remoteAddress=/173.36.39.60:50426}]
javax.net.ssl.SSLException: SSLEngine closed already
at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?]
[2018-11-14T12:00:08,117][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [es-node1-001] exception caught on transport layer [NettyTcpChannel{localAddress=/173.37.96.31:9300, remoteAddress=/173.36.39.60:50672}], closing connection
[2018-11-14T12:07:02,344][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [es-node1-001] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/173.36.39.60:50730}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-11-14T12:08:31,298][INFO ][o.e.c.s.ClusterApplierService] [es-node1-001] removed {{es-node2-001}{V4A7hjvtQcyFW7BlxG-j4w}{3GXiTiZ_QfCfFtprDhA2og}{es-node2-001}{173.36.39.60:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {es-node1-002}{44pA9ErPTb-y3zOylW8Z_Q}{byFkKyl5S1a_s3RiujXyRA}{es-node1-002}{173.37.96.32:9300}{xpack.installed=true} committed version [148]])
[2018-11-14T12:08:31,809][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [es-node1-001] failed to execute on node [V4A7hjvtQcyFW7BlxG-j4w]
org.elasticsearch.transport.NodeDisconnectedException: [es-node2-001][173.36.39.60:9300][cluster:monitor/nodes/stats[n]] disconnected
[2018-11-14T12:08:58,991][INFO ][o.e.c.s.ClusterApplierService] [es-node1-001] added {{es-node2-001}{V4A7hjvtQcyFW7BlxG-j4w}{3GXiTiZ_QfCfFtprDhA2og}{es-node2-001}{173.36.39.60:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {es-node1-002}{44pA9ErPTb-y3zOylW8Z_Q}{byFkKyl5S1a_s3RiujXyRA}{es-node1-002}{173.37.96.32:9300}{xpack.installed=true} committed version [152]])

warkolm · November 15, 2018, 4:25am

Are those the logs from the node that disappears?

elco_comm1982 · November 15, 2018, 6:51am

Hi,
Thats right.. these are from the node that keeps popping out of the cluster

elco_comm1982 · November 15, 2018, 7:10am

Something like this too saw it once

[2018-11-14T23:04:48,723][WARN ][o.e.t.TransportService ] [es-node1-002] Received response for a request that has timed out, sent [82838ms] ago, timed out [52837ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-node2-002}{rMYDt_5ITuCTJbjUPluLJA}{Quk0hNopRJ-CZJyQz81AAg}{es-node2-002}{173.36.39.61:9300}{ml.machine_memory=67556810752, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [536]

Tek_Chand · November 15, 2018, 7:19am

@elco_comm1982, Can you please let me know how many master node you have in your cluster? And what the value of discovery.zen.minimum_master_nodes have you set in your elasticsearch.yml file.

Thanks.

elco_comm1982 · November 16, 2018, 7:47pm

Hi, we have 4 master eligible nodes in the cluster. The cluster is spread across 2 data centers, with each data center having 2 nodes. The value of discovery.zen.minimum_master_nodes is 2. After doing a number of things, i found the overridden value of thread_pool.write.queue_size, thread_pool.index.queue_size was causing the issue.. Once i removed it, the cluster was stable.

Christian_Dahlqvist · November 16, 2018, 8:10pm

That is not good. As per these guidelines, you should have minimum_master_nodes set to 3 as you have 4 master-eligible nodes in order to avoid split-brain scenarios.

warkolm · November 16, 2018, 8:34pm

Are these datacentres spread far apart?

DavidTurner · November 18, 2018, 9:31am

I think that's a typo, minimum_master_nodes must be at least 3 if there are 4 master-eligible nodes. If you leave it at 2 then it's only a matter of time before a brief connectivity loss between your two data centres leads to data loss.

Christian_Dahlqvist · November 18, 2018, 10:31am

Thanks for spotting this. Have corrected it.

elco_comm1982 · November 19, 2018, 7:48pm

Thanks.. Will do this

system · December 17, 2018, 7:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch nodes automatically disconnected Elasticsearch	2	706	August 10, 2021
Nodes eventually disconnect with ssl enabled on transport layer Elasticsearch elastic-stack-security	17	2150	November 17, 2020
(ES 0.90.1) Cannot connect to elasticsearch cluster after a node is removed Elasticsearch	10	758	July 6, 2017
Node is disconnected from cluster and does not join existing cluster (ES 7.16.2) Elasticsearch	2	715	January 1, 2023
Nodes continuously leaving and rejoining the cluster in 7.1 cluster after master switch Elasticsearch	8	2107	October 15, 2020

Problem communicating within nodes in cluster - send message failed, node gets removed from cluster

Related topics