[SOLVED] Frequent node disconnects on Rackspace environment

We have Elasticsearch cluster deployed in Rackspace. Each machine has it's own Server created (Windows Server 2012 R2).

We have three nodes with following elasticsearch.yml:

action.disable_delete_all_indices: true

cluster.name: ClusterUK

network.publish_host: "172.24.32.10"

discovery.zen.ping.timeout: "30s"
discovery.zen.ping_timeout: "30s"
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["172.24.32.10", "172.24.32.5", "172.24.32.8"]

indices.fielddata.cache.size: 25%
indices.cluster.send_refresh_mapping: false

node.name: "ClusterUK Node 1" 
node.master: true
node.data: true

bootstrap.mlockall: true

And that's the logs it's producing:

[2015-11-11 07:39:37,615][INFO ][http                     ] [ClusterUK Node 1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.24.32.10:9200]}
[2015-11-11 07:39:37,615][INFO ][node                     ] [ClusterUK Node 1] started
[2015-11-11 07:39:38,896][INFO ][discovery.zen            ] [ClusterUK Node 1] failed to send join request to master [[ClusterUK Node 1][Ar_pY4NNRBWwTbv9fV226w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}], reason [RemoteTransportException[[ClusterUK Node 1][inet[/172.24.32.10:9300]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}] not master for join request from [[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}]]; ], tried [3] times
[2015-11-11 07:40:09,974][INFO ][cluster.service          ] [ClusterUK Node 1] detected_master [ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}, added {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},[ClusterUK Client Node STG1][Uxmn2i1iSpuxlp3IgjNNdQ][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}])
[2015-11-11 07:42:06,756][INFO ][cluster.service          ] [ClusterUK Node 1] added {[ClusterUK Node 2][UKA81JAURsquFqvH7xiAFg][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}])
[2015-11-11 08:00:37,378][INFO ][discovery.zen            ] [ClusterUK Node 1] master_left [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}], reason [transport disconnected]
[2015-11-11 08:00:37,380][WARN ][discovery.zen            ] [ClusterUK Node 1] master left (reason = transport disconnected), current nodes: {[ClusterUK Node 2][UKA81JAURsquFqvH7xiAFg][elasticuk2][inet[/172.24.32.5:9300]]{master=true},[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[elasticuk1/172.24.32.10:9300]]{master=true},[ClusterUK Client Node STG1][Uxmn2i1iSpuxlp3IgjNNdQ][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},}
[2015-11-11 08:00:37,380][INFO ][cluster.service          ] [ClusterUK Node 1] removed {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}, reason: zen-disco-master_failed ([ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true})
[2015-11-11 08:00:37,985][ERROR][marvel.agent.exporter    ] [ClusterUK Node 1] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$��]
[2015-11-11 08:00:47,996][ERROR][marvel.agent.exporter    ] [ClusterUK Node 1] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$��]
[2015-11-11 08:01:07,407][INFO ][cluster.service          ] [ClusterUK Node 1] detected_master [ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}, added {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}]) 

It seems that master node disconnects for a second and then joins the cluster back. This causes data loss if bulk-inserts are being performed and may lead to split-brain. Does anyone know what's the root cause and how this can be fixed?

Version: 1.7.3

Firewall?
Are you monitoring the network?

Firewall is turned off. Network monitor is turned off too.

Looking at transport issue, i found out that it's using TCP and it might be useful to disable TCP Offload in adapter settings. Article here: http://www.rackspace.com/knowledge_center/article/disabling-tcp-offloading-in-windows-server-2012

Trying it now. Will update

It was TCP Offloading.

TCP offload engine is a function used in network interface cards (NIC)
to offload processing of the entire TCP/IP stack to the network
controller. By moving some or all of the processing to dedicated
hardware, a TCP offload engine frees the system's main CPU for other
tasks. However, TCP offloading has been known to cause some issues,
and disabling it can help avoid these issues.

###Disable TCP Offloading

  1. In the Windows server, open the Control Panel and select Network
    Settings
    > Change Adapter Settings.

Screenshot

  1. Right-click on each of the adapters (private and public), select
    Configure from the Networking menu, and then click the Advanced tab.
    The TCP offload settings are listed for the Citrix adapter.

Screenshot

  1. Disable each of the following TCP offload options, and then click
    OK:
  • IPv4 Checksum Offload
  • Large Receive Offload
  • Large Send Offload
  • TCP Checksum Offload

This solved my issue.

1 Like