One I can think of is that the elected master might be a loaded data node, which can become unresponsive
yes, that is one of the main reasons. A cluster cannot function without a stable and reactive master, so (especially in a larger deployment) you would typically like to avoid putting that responsibility on a node that is also executing queries and handling ingest load.
If any cluster state update needs to be replicated across all (or minimum_master_nodes
) master eligible nodes before sending to non-master eligible nodes, does this configuration make cluster state update much slower in practice for bigger clusters, for example, N == 12 or 36 or 50+? Again, is ESv7 discovery module going to perform better than Zen?
Cluster state publishing in Zen and ESv7 works in the same way and is documented in more detail here: Publishing the cluster state | Elasticsearch Guide [7.0] | Elastic
In particular, the cluster state is sent to all the nodes at once (just prioritizing master-eligible nodes, but not waiting on their response before sending the state to other nodes). Once a majority of the master-eligible nodes have accepted this state is the state actually committed and applied on all nodes. This means that from a publication perspective, the number of master-eligible nodes will not matter that much, except that the state is committed slightly faster.
From a master election perspective, however, avoiding a large number of master-eligible nodes is advisable. Master elections work differently in Zen than ESv7 discovery.
For Zen discovery, elections are based on a 3 second ping phase where nodes learn about what other nodes are around them and then actively vote for a node to become master, based on a deterministic function of the nodes that they've found in the pinging phase. Note that this assumes that they all share the same knowledge of which nodes are out there, which is established by the 3-second pinging phase. More details here: Zen Discovery | Elasticsearch Guide [6.7] | Elastic
In ESv7, elections are Raft-style and based on randomized timeouts (Quorum-based decision making | Elasticsearch Guide [7.0] | Elastic). The advantage of this is that elections can typically be much quicker than the 3 seconds of Zen discovery. However, they bring the risk of election clashes in case where many nodes participate in these elections and start concurrent elections. For this, ESv7 automatically increases the randomized timeouts as elections fail, so that they will eventually succeed. In some experimental setups, this has shown to work with 50+ nodes, but we do not recommend running a cluster that way as it will lead to slower master elections.
If this configuration becomes really bad when N > x, what is the X here?
This depends not only on the number of nodes, but a few other factors, such as network latency for example. Best is to avoid a setup with more than a handful of master-eligible nodes (there is typically no downside to limiting the number of master-eligible nodes), and use dedicated master nodes to ensure maximum stability/resilience of the cluster.