What's the downside of making all data nodes master eligible

I understand it's recommended to configure 3 dedicated master nodes. I'm curious what's the downside of making all N, N > 3, nodes both data nodes and master eligible in practice? One I can think of is that the elected master might be a loaded data node, which can become unresponsive

Would master election process take much longer when current master is down? Does Zen and ESv7 discovery module behave the same in this sense?

If any cluster state update needs to be replicated across all (or minimum_master_nodes) master eligible nodes before sending to non-master eligible nodes, does this configuration make cluster state update much slower in practice for bigger clusters, for example, N == 12 or 36 or 50+? Again, is ESv7 discovery module going to perform better than Zen?

If this configuration becomes really bad when N > x, what is the X here?

All nodes, even data and client nodes hold cluster state. The active master is the only one that can update it though.

I don't know enough of the rest of your question to answer with authority though, but I am sure someone else will be able to help :slight_smile:

Thanks. My bad. I meant " If any cluster state update needs to be replicated across all (or minimum_master_nodes ) master eligible nodes before sending to non-master eligible nodes". Already updated the question.

One I can think of is that the elected master might be a loaded data node, which can become unresponsive

yes, that is one of the main reasons. A cluster cannot function without a stable and reactive master, so (especially in a larger deployment) you would typically like to avoid putting that responsibility on a node that is also executing queries and handling ingest load.

If any cluster state update needs to be replicated across all (or minimum_master_nodes ) master eligible nodes before sending to non-master eligible nodes, does this configuration make cluster state update much slower in practice for bigger clusters, for example, N == 12 or 36 or 50+? Again, is ESv7 discovery module going to perform better than Zen?

Cluster state publishing in Zen and ESv7 works in the same way and is documented in more detail here: https://www.elastic.co/guide/en/elasticsearch/reference/7.0/cluster-state-publishing.html
In particular, the cluster state is sent to all the nodes at once (just prioritizing master-eligible nodes, but not waiting on their response before sending the state to other nodes). Once a majority of the master-eligible nodes have accepted this state is the state actually committed and applied on all nodes. This means that from a publication perspective, the number of master-eligible nodes will not matter that much, except that the state is committed slightly faster.

From a master election perspective, however, avoiding a large number of master-eligible nodes is advisable. Master elections work differently in Zen than ESv7 discovery.

For Zen discovery, elections are based on a 3 second ping phase where nodes learn about what other nodes are around them and then actively vote for a node to become master, based on a deterministic function of the nodes that they've found in the pinging phase. Note that this assumes that they all share the same knowledge of which nodes are out there, which is established by the 3-second pinging phase. More details here: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/modules-discovery-zen.html#master-election

In ESv7, elections are Raft-style and based on randomized timeouts (https://www.elastic.co/guide/en/elasticsearch/reference/7.0/modules-discovery-quorums.html#_master_elections). The advantage of this is that elections can typically be much quicker than the 3 seconds of Zen discovery. However, they bring the risk of election clashes in case where many nodes participate in these elections and start concurrent elections. For this, ESv7 automatically increases the randomized timeouts as elections fail, so that they will eventually succeed. In some experimental setups, this has shown to work with 50+ nodes, but we do not recommend running a cluster that way as it will lead to slower master elections.

If this configuration becomes really bad when N > x, what is the X here?

This depends not only on the number of nodes, but a few other factors, such as network latency for example. Best is to avoid a setup with more than a handful of master-eligible nodes (there is typically no downside to limiting the number of master-eligible nodes), and use dedicated master nodes to ensure maximum stability/resilience of the cluster.

4 Likes

Thank you very much, Yannick. It's amazing to hear why and how from the author. I'm sure your reply will benefit many who want to gain in-depth understanding of why master election is designed/implemented in this way.

I've read ZenDiscovery code myself and plugged in zookeeper like thing to zen in 1.x because it was less reliable back then. I watched a nice video by elastic that revealed how master election evolved from 1.x to 7.x, with 5.x being much more reliable and 7.x being much more correct and faster.

@ywelsch A follow up question: if plugging in zookeeper is not advised in 5.x+, is it simply because it's unnecessary or does negative outweigh positive it brings? Why is it a terrible idea to modify master election and plug in zookeeper in 7.x if it is the case?

Thank you for the kind words.

if plugging in zookeeper is not advised in 5.x+, is it simply because it's unnecessary or does negative outweigh positive it brings? Why is it a terrible idea to modify master election and plug in zookeeper in 7.x if it is the case?

I have never run the Zookeeper plugin which, as far as I'm aware, only existed for the 1.x series, and can't comment much about its functionality. In general, building this functionality directly into the system instead of relying on third-party software has a number of advantages, for example not requiring the knowledge to operate and maintain yet another system, leader election / fault detection and other parts can be custom-tailored for the software at hand, and also debugging issues becomes easier if it's all in one system, having more context at hand.

The built-in ES cluster coordination layer (aka Zen Discovery) became much more stable in 2.x, and 5.x+ required quite exceptional circumstances to trigger issues at this layer (see " Repeated network partitions can cause cluster state updates to be lost" on the resiliency status page, which is now marked as fixed in ES 7.0.0). Issues we encountered in the field with 5.x+ could most often be attributed to user errors (e.g. failing to properly configure minimum_master_nodes), or bugs in other components of the system that resulted in data loss.

When we set out to rebuild the cluster coordination layer in 7.0.0 (more details in our recent blog post here), our goals were therefore to first and foremost simplify the story around minimum_master_nodes and other parts of the system which were too lenient (see "Safety first" section in the linked blog post), and at the same time make sure that this new implementation would be developed according to the latest engineering and testing standards, using for example formal methods to validate our designs up-front and extended testing with deterministic simulations (see e.g. https://www.youtube.com/watch?v=4fFDFbi3toc as a good intro to this).

Having the coordination layer directly built into Elasticsearch gives us the most flexibility to adapt it directly to Elasticsearch's needs, whether it's the dynamic nature of scaling or possible future extensions (e.g. voting-only nodes).

2 Likes

To clarify, I meant zookeep-like thing (e.g. writing master to a shared datastore) instead of zookeeper plugin in my previous post. Thank you very much for your reply! :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.