[SOLVED] Frequent node disconnects on Rackspace environment

buinauskas_evaldas · November 11, 2015, 8:15am

We have Elasticsearch cluster deployed in Rackspace. Each machine has it's own Server created (Windows Server 2012 R2).

We have three nodes with following elasticsearch.yml:

action.disable_delete_all_indices: true

cluster.name: ClusterUK

network.publish_host: "172.24.32.10"

discovery.zen.ping.timeout: "30s"
discovery.zen.ping_timeout: "30s"
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["172.24.32.10", "172.24.32.5", "172.24.32.8"]

indices.fielddata.cache.size: 25%
indices.cluster.send_refresh_mapping: false

node.name: "ClusterUK Node 1" 
node.master: true
node.data: true

bootstrap.mlockall: true

And that's the logs it's producing:

[2015-11-11 07:39:37,615][INFO ][http                     ] [ClusterUK Node 1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.24.32.10:9200]}
[2015-11-11 07:39:37,615][INFO ][node                     ] [ClusterUK Node 1] started
[2015-11-11 07:39:38,896][INFO ][discovery.zen            ] [ClusterUK Node 1] failed to send join request to master [[ClusterUK Node 1][Ar_pY4NNRBWwTbv9fV226w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}], reason [RemoteTransportException[[ClusterUK Node 1][inet[/172.24.32.10:9300]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}] not master for join request from [[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}]]; ], tried [3] times
[2015-11-11 07:40:09,974][INFO ][cluster.service          ] [ClusterUK Node 1] detected_master [ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}, added {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},[ClusterUK Client Node STG1][Uxmn2i1iSpuxlp3IgjNNdQ][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}])
[2015-11-11 07:42:06,756][INFO ][cluster.service          ] [ClusterUK Node 1] added {[ClusterUK Node 2][UKA81JAURsquFqvH7xiAFg][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}])
[2015-11-11 08:00:37,378][INFO ][discovery.zen            ] [ClusterUK Node 1] master_left [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}], reason [transport disconnected]
[2015-11-11 08:00:37,380][WARN ][discovery.zen            ] [ClusterUK Node 1] master left (reason = transport disconnected), current nodes: {[ClusterUK Node 2][UKA81JAURsquFqvH7xiAFg][elasticuk2][inet[/172.24.32.5:9300]]{master=true},[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[elasticuk1/172.24.32.10:9300]]{master=true},[ClusterUK Client Node STG1][Uxmn2i1iSpuxlp3IgjNNdQ][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},}
[2015-11-11 08:00:37,380][INFO ][cluster.service          ] [ClusterUK Node 1] removed {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}, reason: zen-disco-master_failed ([ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true})
[2015-11-11 08:00:37,985][ERROR][marvel.agent.exporter    ] [ClusterUK Node 1] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$��]
[2015-11-11 08:00:47,996][ERROR][marvel.agent.exporter    ] [ClusterUK Node 1] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$��]
[2015-11-11 08:01:07,407][INFO ][cluster.service          ] [ClusterUK Node 1] detected_master [ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}, added {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}])

It seems that master node disconnects for a second and then joins the cluster back. This causes data loss if bulk-inserts are being performed and may lead to split-brain. Does anyone know what's the root cause and how this can be fixed?

Version: 1.7.3

warkolm · November 11, 2015, 8:18am

Firewall?
Are you monitoring the network?

buinauskas_evaldas · November 11, 2015, 8:23am

Firewall is turned off. Network monitor is turned off too.

Looking at transport issue, i found out that it's using TCP and it might be useful to disable TCP Offload in adapter settings. Article here: http://www.rackspace.com/knowledge_center/article/disabling-tcp-offloading-in-windows-server-2012

Trying it now. Will update

buinauskas_evaldas · November 11, 2015, 10:32am

It was TCP Offloading.

TCP offload engine is a function used in network interface cards (NIC)
to offload processing of the entire TCP/IP stack to the network
controller. By moving some or all of the processing to dedicated
hardware, a TCP offload engine frees the system's main CPU for other
tasks. However, TCP offloading has been known to cause some issues,
and disabling it can help avoid these issues.

###Disable TCP Offloading

In the Windows server, open the Control Panel and select Network
Settings > Change Adapter Settings.

Right-click on each of the adapters (private and public), select
Configure from the Networking menu, and then click the Advanced tab.
The TCP offload settings are listed for the Citrix adapter.

Disable each of the following TCP offload options, and then click
OK:

IPv4 Checksum Offload
Large Receive Offload
Large Send Offload
TCP Checksum Offload

This solved my issue.

Topic		Replies	Views
Discovery_zen disconnect issues Elasticsearch	5	382	July 6, 2017
Frequent disconnects between nodes Elasticsearch	13	2293	July 6, 2017
Cluster failures Elasticsearch	2	284	July 6, 2017
Elasticsearch 5.0.1 fails to be set as a cluster on AWS Elasticsearch	4	1311	December 16, 2016
ES nodes disconnects intermittently from the cluster Elasticsearch	1	630	February 8, 2018

[SOLVED] Frequent node disconnects on Rackspace environment

Related topics