We have Elasticsearch cluster deployed in Rackspace. Each machine has it's own Server created (Windows Server 2012 R2).
We have three nodes with following elasticsearch.yml
:
action.disable_delete_all_indices: true
cluster.name: ClusterUK
network.publish_host: "172.24.32.10"
discovery.zen.ping.timeout: "30s"
discovery.zen.ping_timeout: "30s"
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["172.24.32.10", "172.24.32.5", "172.24.32.8"]
indices.fielddata.cache.size: 25%
indices.cluster.send_refresh_mapping: false
node.name: "ClusterUK Node 1"
node.master: true
node.data: true
bootstrap.mlockall: true
And that's the logs it's producing:
[2015-11-11 07:39:37,615][INFO ][http ] [ClusterUK Node 1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.24.32.10:9200]}
[2015-11-11 07:39:37,615][INFO ][node ] [ClusterUK Node 1] started
[2015-11-11 07:39:38,896][INFO ][discovery.zen ] [ClusterUK Node 1] failed to send join request to master [[ClusterUK Node 1][Ar_pY4NNRBWwTbv9fV226w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}], reason [RemoteTransportException[[ClusterUK Node 1][inet[/172.24.32.10:9300]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}] not master for join request from [[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[/172.24.32.10:9300]]{master=true}]]; ], tried [3] times
[2015-11-11 07:40:09,974][INFO ][cluster.service ] [ClusterUK Node 1] detected_master [ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}, added {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},[ClusterUK Client Node STG1][Uxmn2i1iSpuxlp3IgjNNdQ][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}])
[2015-11-11 07:42:06,756][INFO ][cluster.service ] [ClusterUK Node 1] added {[ClusterUK Node 2][UKA81JAURsquFqvH7xiAFg][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}])
[2015-11-11 08:00:37,378][INFO ][discovery.zen ] [ClusterUK Node 1] master_left [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}], reason [transport disconnected]
[2015-11-11 08:00:37,380][WARN ][discovery.zen ] [ClusterUK Node 1] master left (reason = transport disconnected), current nodes: {[ClusterUK Node 2][UKA81JAURsquFqvH7xiAFg][elasticuk2][inet[/172.24.32.5:9300]]{master=true},[ClusterUK Node 1][z2poU5hqQT-VmBKJifD0-w][elasticuk1][inet[elasticuk1/172.24.32.10:9300]]{master=true},[ClusterUK Client Node STG1][Uxmn2i1iSpuxlp3IgjNNdQ][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},}
[2015-11-11 08:00:37,380][INFO ][cluster.service ] [ClusterUK Node 1] removed {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}, reason: zen-disco-master_failed ([ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true})
[2015-11-11 08:00:37,985][ERROR][marvel.agent.exporter ] [ClusterUK Node 1] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$��]
[2015-11-11 08:00:47,996][ERROR][marvel.agent.exporter ] [ClusterUK Node 1] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$��]
[2015-11-11 08:01:07,407][INFO ][cluster.service ] [ClusterUK Node 1] detected_master [ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}, added {[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 3][m5ns1sKHTDSSdbBMWNsqwA][elasticuk3][inet[/172.24.32.8:9300]]{master=true}])
It seems that master node disconnects for a second and then joins the cluster back. This causes data loss if bulk-inserts are being performed and may lead to split-brain. Does anyone know what's the root cause and how this can be fixed?
Version: 1.7.3