How to recover after bad rolling upgrade

I was following the rolling upgrade documentation and the first node I was trying to upgrade from 7.3 to 7.6 didn't upgrade. It still shows that I am on 7.3 and now my cluster health shows red.

Now I am seeing:

.kibana_task_manager 0 p UNASSIGNED CLUSTER_RECOVERED
.kibana_task_manager 0 r UNASSIGNED CLUSTER_RECOVERED

Running:

GET /_cluster/allocation/explain

Returns:

{
  "index" : ".kibana_task_manager",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2020-03-20T06:29:05.386Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "GLPJmH7-SBKATglJ4iu1IA",
      "node_name" : "node_01",
      "transport_address" : "XXX.XXX.XXX.XXX:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8363737088",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "CM-Un5daQOWv_CzekU0bRA"
      }
    },
    {
      "node_id" : "ZAKTP504TnujdrIM___Ljg",
      "node_name" : "node_02",
      "transport_address" : "XXX.XXX.XXX.XXX:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8363724800",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}

I have read some solutions, but am a little scared to start trying things. Any help or advice would be greatly appreciated.

Have you completed the upgrade? One of your nodes has a stale copy of this shard and the other has no copy at all, which indicates that there's another node out there that has the good copy of this shard.

Wait, CLUSTER_RECOVERED means that this cluster has experienced a full cluster restart, i.e. you restarted the master nodes. That doesn't happen in a rolling upgrade.

I stopped after seeing the RED status and started googling to see if I could find out why my node never recovered after restarting it. I did turn the cluster off last night and turned it back on this morning. I am just running it on a couple vm's.

Ok, you are in the situation described in the IMPORTANT note at the bottom of the rolling upgrade instructions:

If you stop half or more of the master-eligible nodes all at once during the upgrade then the cluster will become unavailable, meaning that the upgrade is no longer a rolling upgrade. If this happens, you should upgrade and restart all of the stopped master-eligible nodes to allow the cluster to form again, as if performing a full-cluster restart upgrade. It may also be necessary to upgrade all of the remaining old nodes before they can join the cluster after it re-forms.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.