High Availability of my elastic cluster

Hello,

I have to maintain an elastic cluster composed of 8 data nodes on 4 virtual machines, 2 nodes per virtual machine.

We have one index created every day, composed of 8 shards and 1 replica per shards.

In this configuration, how many elastic nodes can i loose without any data loss ?

What should be the configuration if i want to support the loss of 4 nodes without any data loss ?

Thanks for your help
pme

One. If you lose more than one node then it is possible you would lose both a primary and replica.

You need each index to have 4 replicas, so that the cluster holds five copies of every shard. You also need at least 9 master-eligible nodes so that the 4 nodes you lost is fewer than half of the total.

It might be cheaper to invest in more reliable infrastructure. It's normally sufficient to plan to deal with a single node going down, and if you're paranoid then you might try and deal with two. With reasonable infrastructure the probability of losing four nodes all at once should be infinitesimal compared with, say, a bug in your client software that accidentally deletes everything, against which no amount of redundancy can protect.

1 Like

Hi David,

Thank you for your response !

Why do i would need at least 9 master-eligible nodes so that the 4 nodes i lost is fewer than half of the total ?

Thank you

Pierre

Envoyé : dimanche 20 janvier 2019 13:37

Elasticsearch, like many other distributed systems, elects a master node using a majority-based voting system and so can only operate while a majority of the master-eligible nodes are available. If you want to tolerate the loss of 4 nodes then your majority must be at least 5 nodes, giving 9 nodes in total.

Hi David,

Thanks for you response,

I have a discovery.zen.minimum_master_nodes =5

I have 8 nodes, 4 on each sites and imagine i loose one site.

The site where there was no master can’t reform a cluster (4 nodes) and the site where there was the master keep its master elected ?
Am i right ?

Thank you !

Pierre

Envoyé : jeudi 7 février 2019 23:33

No, that's not right. minimum_master_nodes is the minimum number of master-eligible nodes that are needed for the cluster to operate. If you set it to 5 then you need 5 master-eligible nodes to be available at all times. If you only have 4 surviving nodes then that's not enough.

You didn't mention these "sites" earlier, so the answers I gave were about losing an arbitrary four nodes. But if your 8 nodes are split into two "sites" of four nodes then I suspect that really you are concerned with losing one or other site, not any random set of four nodes. This is much easier to deal with. Forced shard allocation awareness will split the shard copies evenly across the sites, so you can get away with a single replica of each shard and still be safe if one or other site is unavailable.

However, if your cluster is split across just two sites then there is no way to make it truly resilient to the loss of either site. I mean that is theoretically impossible, not that this is a limitation in Elasticsearch. The way people normally do this is to install a single master-eligible node in each site, and add a third "tiebreaker" site that just contains a single master-eligible node, giving three master-eligible nodes in total.

David,

Ok, i will investigate on « the forced shard allocation awareness ».

You say : « However, if your cluster is split across just two sites then there is no way to make it truly resilient to the loss of
either site. »

But if i accept the risk of split brain (I can’t loose my intersite link) and i put the minimum_master_nodes to 4.

If i loose one site, the other can elect a master with just four nodes ? Or it’s mandatory to have an odd number of eligible master
node ?

Today, in production we have :

An even number of master eligible node.

Do ingest node participate in the election of the master node ?

Thank you for your response

Pierre

Envoyé : mercredi 13 février 2019 11:14

You will lose your intersite link at some point. Networks are not reliable. If you have minimum_master_nodes set to 4 then both sides will be able to elect masters and form independent clusters. If you manage to get them to join back up again then you will see data loss and maybe find some indices to be unrecoverably corrupt. I do not see how this is a risk you want to accept.

As I said, this is not a limitation within Elasticsearch, it's a theoretical impossibility. Fault tolerance requires at least three independent failure domains.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.