Multi-cluster installation, cluster loss and recovery

Hi there,

As I'm new, I'll introduce myself quickly : I'm Christophe Dame and I'm working for Camunda. Our solution embarks an Elasticsearch 7.17 and I'm currently experimenting a dual cluster setup.
Ideally, my goal would be to have an Elasticsearch spanning over 2 clusters. My first naive attempt is to install 2 ES nodes in cluster 1 and 2 ES nodes in cluster2 (using helm charts). I've added nodeGroup to each of them. I've added seed_hosts to each pointing to the headless services. And only the first cluster I starts as the initial_master_nodes set. I forced an empty value in the second cluster.
Result is good as I then have a 4 nodes healthy cluster. I run some operations that create indices and data is available in my web applications reading locally in both clusters. Youpi!
Then I destroy a cluster and recreate it... and the recreated nodes wont join. I get such an error : ["org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

Do you have any advice ? Something I'm doing wrong or that I missed ?

Thanks in advance!

That looks like a cluster formation problem, see these docs for information about how to troubleshoot it effectively.

If I had to guess, I think you probably destroyed too many master nodes at once. See these docs for information about setting up a resilient cluster, particularly the section on two-zone clusters:

You cannot configure a two-zone cluster so that it can tolerate the loss of either zone because this is theoretically impossible.

(terminology note, the whole Elasticsearch installation is called a "cluster" in these docs, and what you are calling a "cluster" is closer to a "zone")

Hi David,

Many thanks for your quick response. You're right about the impossibility to survive the loss of half nodes. But my assumption would be that the data is replicated across the nodes and that I should not loose any data.
By recreating empty nodes (named as the lost ones), I sould be able to redistribute the data and have my cluster healthy again, or ?

By recreating empty nodes (named as the lost ones), I sould be able to redistribute the data and have my cluster healthy again, or ?

No, that's not safe. As the docs say, what you are trying to do is not even theoretically possible.

Thanks David, I think I get your point. I'll try to think about a workaround for that particular case. Thanks again :slight_smile:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.