Basic HA/Failover Setup

Hi All, I am looking for the basic configuration for HA/Failover. I have 2 clusters.
The first cluster consists of separate coordinating and master/data nodes running in 2 processes in one system while in the 2nd cluster the Master/Data nodes run as one unit. The configurations are below

Master Node on Cluster 1 on Machine 1

cluster.name: remote
node.name: Remote-Master
node.master: false
node.data: false
node.ingest: false
path.data: D:\ELK-Runnable\Elasticsearch\data
path.logs: D:\ELK-Runnable\Elasticsearch\logs
network.host:   0.0.0.0
http.port: 9200
transport.port: 9300
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.unicast.hosts: ["0.0.0.0:9300", "0.0.0.0:9301", "192.168.116.138:9300"]

Data Node on Cluster 1 on Machine 1

cluster.name: remote
node.name: Remote-Data
node.master: true
node.data: true
path.data: D:\ELK-Runnable\Elasticsearch_Data\data
path.logs: D:\ELK-Runnable\Elasticsearch_Data\logs
network.host:   0.0.0.0
http.port: 9201
transport.port: 9301
discovery.zen.ping.unicast.hosts: ["0.0.0.0:9300", "0.0.0.0:9301","192.168.116.138:9300"]

Master and Data on Cluster 2 on Machine 2

cluster.name: backup
path.data: C:\Program Files\Elasticsearch\data
path.logs: C:\Program Files\Elasticsearch\logs
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
discovery.zen.ping.unicast.hosts: ["192.168.119.1:9300", "192.168.119.1:9301", "0.0.0.0:9300"]

I uploaded 1 index and it was done successfully. I have enabled data replication and data is successfully copied to the 2nd cluster. Now, I shutdown the Master/Data node in cluster 1 and was hoping that the 2nd cluster would support the failover. I have added the 2nd cluster node in the zen discovery section of the .yml file. However, it does not work. Any suggestions would be helpful. Thanks

A cluster cannot tolerate the loss of half or more of its master-eligible nodes. Therefore you need at least three master-eligible nodes in order to have a fault-tolerant cluster.

Thanks David. I am still a newbie to this. So pl correct my understanding here
I have the following

1 (call it A1) - Coordinating & Master Nodes in 1 process on Machine A in Cluster 1
1 (call it A2)- Data & Master Nodes in 1 process on Machine A in Cluster 1

1 (call it B1) Master and Data in 1 process on Machine B in Cluster 2 This has been configured with CCR.

Now, I stop A2 hence A1 should link to B1. However, this does not happen.
So if we need to have 3 master eligible nodes, this would mean the following

  • Potentially having 2 Master nodes and a Data Node in Cluster 1. Total of 3 Nodes
  • Next, stop the Data Node to mimic a failover scenario
  • Cluster 1 will automatically fall back on the Node in Cluster 2

Is this assertion correct?

No, that's not what should happen. There's no safe way to move a node to a different cluster without risking data loss, so Elasticsearch won't do that. If you stop A2 then you have stopped half of the master-eligible nodes in cluster A, and the only safe way to proceed is to start A2 again.

Cool. Thanks David. I will figure something out...

Why do you not just set up a single cluster with 3 nodes that hold data and are master eligible? This cluster can handle one node failing and is highly available.