Recover a broken 3 node elasticsearch cluster that has only 1 node left

Hi, I have Elasticsearch cluster (7.6.1) that had 3 nodes. 2 of the nodes were accidentally dropped while all 3 servers were running. 1 of the 3 servers survived. I want to save the cluster and its data. How do I add new nodes to it? Current cluster status is,

{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
} 

Let me guess that the surviving node was not a master? Even if it was (and others were), you lost > 50% of your masters so it can't run an election, so the cluster can't recover.

Others may have more ideas, but I don't think there is a recovery path from here, as you can't get an elected master and without that, can't change things, find data, etc. No elected master = no cluster = no data :frowning:

It would be nice if there were more recovery methods, such as to tell an old master-eligible node to consider itself the single master with its stored state, and let a recovery proceed from there with any existing nodes, etc, but alas, no one has built that yet, as far as I know.

Elasticsearch does a lot to protect your data, but once it breaks, it really breaks and has little in the way of tools or methods to recover even part of lost data (e.g. segment export, meta-data bottom-up rebuild, etc.) Some day (we used to have/build these for old DB/code systems). And maybe enterprise-level support has special tools.

Hope you have snapshot ... and others have ideas.

Yep, snapshots are the answer here, or else build a fresh cluster and index the data from its original source again.

The kinds of low-level tools you are describing are very unsafe and users often misunderstand what their weak guarantees mean for the integrity of the data they claim to recover. No tool can protect you from every kind of disaster or mistake, so you have to take snapshots anyway if you care about your data, but if you have snapshots then there's little value in lower-level "rescue" tooling.

You get better support for sure, but there's no magical secret tool to fix this kind of disaster if you pay enough money. I mean there's snapshots, they're magic, but also free to use :smile:

Generally agreed, but frankly at some level and if I was running & paying for an enterprise product at scale, I'd want more tools for recovery, even partial for real disasters & situations - at least for loss of masters/voting where data may not really be lost, but it's considered lost (like split brain).

Surely educate people, but also trust that senior people with larger systems and enterprises can judge risk and tooling at various levels of need, etc. Especially as systems get larger and even snapshot recovery times get really long - in may ways ES is more reliable but also more brittle (no support for two-zone cluster issues still irks me).

A lot of energy was poured into these things in the RDBMS world going back decades, 3rd party tools, internal details, disaster recovery systems, etc. for what's mostly a single use case (DBMS), but I feel Elasticsearch is more powerful & flexible, but overly insular in some ways.

Just my overall feel, as this becomes more important as it gets used for more things, for more data, and for more mission-critical systems.

Thank you all for the insight. It really helps. Much appreciated. We don't have snapshots. So we will try to rebuild the cluster. I admit that not being able to recover from the remaining node by making it a new master is disappointing.

Snapshots are super easy to setup and work very well, including mostly incremental updates so quite quick - can push to files or S3, etc. so suggest you get them going as soon as you can. Really the nicest backup system there is, in my opinion.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.