Forcing elasticsearch cluster to run on single node in the event of failure

Hi folks,

Currently we are in the process of setting up 3 node elasticsearch cluster on v7.11.2 (self-managed) using below setup :

  • All 3 nodes will be data nodes and master eligible.
  • We will set "index.number_of_replicas": 2 and use forced awareness to ensure each node has a copy of every shard.
  • Single instance of Kibana will run and point to all 3 nodes.

If one node fails the cluster will still hold together, however if 2 nodes fail then the cluster essentially stops working as it cannot establish quorum.

I have tried using voting_config_exclusions API. But this needs minimum 2 nodes to execute. On a single node, this does not work. I think discovery.zen.minimum_master_nodes setting is deprecated and also not advisable.

My question is: Is it possible to run this setup on a single node as a contingency if the other 2 nodes cannot be recovered for any reason. Is the only way to achieve this by adding discovery.type: "single-node" in the yml of node1 and take a restart ?

Thank you.

If you are having three nodes, there is no need for forced awareness, because there will never be the same two replicas on a single node.

discovery type single node means, no cluster will ever be formed, as it is supposed to run as a single node.

If two nodes fail from a three node cluster, you will have a failed cluster. There is no way to have a functioning cluster, that this will never be a quorum. Can you maybe explain what you are trying to achieve in case a quorum of your nodes is offline?

Hi @spinscale

I am listing down various failure scenario and how we can ensure that the operations can keep going with minimum downtime. One of them was with respect to failure of 2 nodes - I am trying to figure out how we can continue to run with single node.

Thank you.

If you have a three node cluster it will require at least 2 nodes to be available to work properly. You might be able to manuallt get a single node cluster to work by reconfiguring it but that can be risky and is not guaranteed to work as far as I know.

To understand why this is the case you need to consider varioous possible failure scenarios and how these would appear to the nodes in the cluster. If 2 nodes go down and the remaining node knows about this it would be safe to continue operating, but the problem here is that the single node can not tell the difference between the 2 other nodes going down and it simply losing network connectivity to them.

As it is possible the remaining 2 nodes may still be connected and operating, the single node can not automatically continue working wothout risking a split-brain scenario and data loss.