Elasticsearch cluster resiliency and availability

We are currently running ES on a 4 node cluster where 2 nodes are in DC 1 & 2 in DC 2. We have a third node in DC 1 with master-voting-only role. The n/w latency b/w DC 1 & DC 2 is negligible.
DC 1 - 2 nodes (all roles) + 1 node master-voting-only
DC 2 - 2 nodes (all roles)

We are looking to expand the cluster and add 3 nodes in each DC - 5 in each DC with a total of 10 and remove the master-voting-only node from DC 1.
To achieve better cluster resiliency and availability, we thought of 2 approaches.

Approach 1:
Setup 2 separate clusters in each DC (Cluster-1 in DC 1 & Cluster-2 in DC 2) and configure CCR b/w Cluster-1 & Cluster-2. Setup indexes in Cluster-2 to replicate from Cluster-1.
But this approach has several operational issues (setting up follower indexes for all existing indexes) as well as the main issue that we cannot operate solely out of Cluster-2 due to the fact that documents cannot be indexed/updated directly in the follower indexes without a complicated manual process to change them from follower to leader.

Approach 2:
Setup a single cluster with all 10 nodes and configure a master ensemble of 5 nodes with 3 in DC 1 and 2 in DC 2.
DC 1 - 3 nodes (all roles) + 2 nodes (all roles except master)
DC 2 - 2 nodes (all roles) + 3 nodes (all roles except master)

We plan to use the cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values to ensure that a primary and replica shard are not located on nodes in the same DC.
This solves the issue of data availability in both DCs to certain extent.

In a scenario where cluster.routing.allocation.awareness.force.zone.values is set and only the nodes in DC 1 are running (DC 2 nodes are down), if one of the nodes in the DC 1 goes down then it will result in a data loss as there are no replicas to load the shards on another node in DC 1. Is this understanding correct?

There is another existing problem that needs to be solved - ensure the cluster is functioning if all nodes in a DC go down.
With the planned master nodes configuration, the cluster will work fine if all nodes in DC 2 go down within a short time interval as the 3 master nodes in DC 1 can form a quorum.
But if all nodes in DC 1 go down before the cluster has a chance to rebalance, then the cluster stops functioning.
How can the cluster be made functional again in this situation?
If we restart a couple of non-master nodes in DC 2 with master role, will it form a quorum with the existing 2 master nodes in DC 2? I don't think this will work as a quorum is needed with existing master nodes to change the voting configuration. Am I correct?
Or can we remove the master nodes from DC 1 from the voting configuration explicitly using the voting configuration API?

The docs seem to suggest that recovery from this situation is impossible. Is it really the case?

You cannot configure a two-zone cluster so that it can tolerate the loss of either zone because this is theoretically impossible.


Yes, it's essentially the two generals problem which is provably unsolvable.

Hi David,
I understand the problem and am not expecting the ES cluster to resolve by itself. Isn't there a way to force the cluster to remove the unavailable master nodes from its voting configuration manually?
This stems from the fact that the admin is aware that there is no n/w issue and that the unavailable master nodes aren't running anymore.


Is it not possible to remove the unavailable master nodes manually using the voting config exclusions API?


The voting exclusions API needs an elected master so no that's not going to work.

There's no safe way to solve this. The cluster state is stored on a majority subset of the current masters. If there's a majority still available then the cluster will automatically elect a master so there's no problem. If there is no majority still available then there's no way to find the latest cluster state, since it could be held only on the masters that aren't running any more.

Surely there must be some way to get the cluster working again.
Can the remaining nodes be restarted as a new cluster with a different "cluster.name" property? Would it load up all the existing indexes?


If there were a safe way to recover from this situation then Elasticsearch would implement it. But as the docs say: what you are asking for is not even theoretically possible.

You need a tiebreaker node in an independent third zone.

See these docs too for instance:

If the logs or the health report indicate that Elasticsearch can’t discover enough nodes to form a quorum, you must address the reasons preventing Elasticsearch from discovering the missing nodes. The missing nodes are needed to reconstruct the cluster metadata. Without the cluster metadata, the data in your cluster is meaningless. The cluster metadata is stored on a subset of the master-eligible nodes in the cluster. If a quorum can’t be discovered, the missing nodes were the ones holding the cluster metadata. [...] If you can’t start enough nodes to form a quorum, start a new cluster and restore data from a recent snapshot.

Hi David,
I read the bit about the the tie-breaker node in a third independent zone. Most production systems have 2 env - a primary and a secondary. The secondary is supposed to be the backup. I have not seen a case where a 3rd env is available which can be used to run the tie-breaker node.

Looks like we have 2 options:
Option 1:
Run 2 separate clusters with secondary following the primary. Issue here is the manual effort in setting up the follower indexes and failing over to secondary when primary fails.
Is there an easy way to switch the follower indexes into leaders during failover? How do we revert back to the original setup once the primary cluster is running again? I searched the docs/google and did not find any documentation on this.

Option 2:
Run a single cluster across both env. Use periodic snapshots to restore the data on a new secondary cluster in case of any sudden failure of all nodes in primary env. This would result in loss of data as the snapshot will not be current . It would be tedious to shut down all the running nodes in secondary env and restart them to form a new cluster before restoring the snapshot.

Both options seem to involve a lot of manual effort. :slightly_frowning_face:
Is there a third option?

One more question:
We are planning to run a Kibana instance to access the cluster. Should Kibana be running on a different host from the cluster?


Bidirectional replication is another possibility - in many cases this neatly avoids the need to take any action when a whole zone is down since if the zone is down then there's nothing writing to that zone's indices.

But personally I'd recommend looking for somewhere to host a tiebreaker node, it's a lot simpler. Most HA production systems we encounter have 3 or more zones, precisely because with 2 zones the two-generals-problem makes HA setups infeasible. That's why the vast majority of all public cloud regions have at least 3 zones too. In fact we know of some folks who use a nearby public cloud to host the tiebreaker if they're running in a colo environment that can't offer a fully HA setup, maybe that's something you could consider too.

I looked at bidirectional replication as well. But doc says that updates to documents are not allowed.

This configuration is particularly useful for index-only workloads, where no updates to document values occur. In this configuration, documents indexed by Elasticsearch are immutable.

We have several indexes where documents are updated after the initial ingestion.

I will check if we can run a dedicated voting-only-master in a different location. The issue here would be the n/w latency but since this is a voting-only-master it may not be a serious issue.

Otherwise, we may have to setup uni-directional replication b/w 2 clusters and then failover to the secondary cluster if primary goes down.

I believe changing follower indexes to normal ones should be possible from Kibana, correct?

Regarding my other question on Kibana, should it run on a separate host outside the cluster or can it be co-located on any of the nodes in the ES cluster?


It doesn't, it just says it's "particularly useful" for index-only workloads. Updates should work fine too. TBH I've no idea what that sentence is trying to say.

Not sure it matters in most cases, although in general it's best to dedicate resources to ES if only to make performance troubleshooting easier in future.

Hi David,
We don't have a way to test if updates will flow in the case of bi-directional replication currently. We will review which one is the best option for us to ensure cluster availability & resiliency.

Nevertheless, thank you very much for your advice.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.