Recover runner that lost it's data

Hi there
I'm currently performing some reliability testing on my ECE cluster to see how I can recover from various scenarios.
I'm currently stuck with the following situation:

  • A allocator node has lost all it's data.
  • It's still registered in the cluster and I'm unable to delete it as the UI promts me to 'demote' the node first. Which when I try to do will result in 'Coordinator candidate [] is not a coordinator instance.'
  • the node has only the allocator role assigned
  • I've tried assigning more roles and demoting the node and then removing the roles and demoting the node but neither worked as the changes can never be applied (since the node no longer exists)

If I try to replace the node by joining a new node into the cluster with the same id. I get the following error:

  • Running Bootstrap container
  • Monitoring bootstrap process
  • Loaded bootstrap settings for additional host {}
  • Core services started. {}
  • Starting local runner {}
  • Started local runner {}
  • Waiting for runner container node {}
  • Runner container node detected {}
  • Waiting for local runner to generate runner-specific secrets {}
  • Waiting for local runner to generate runner-specific secrets {}
  • Unhandled error. {}
    -- An error has occurred in bootstrap process. Please examine logs --
    org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /services/runners/ece-allocator-6.service.consul/containers
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:117)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at no.found.curator.FutureBackgroundCallback.processResult(FutureBackgroundCallback.scala:22)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:852)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:629)
    at org.apache.curator.framework.imps.GetDataBuilderImpl$3.processResult(GetDataBuilderImpl.java:272)
    at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:590)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
    ~
    Errors have caused Elastic Cloud Enterprise installation to fail - Please check logs
    Node type - additional

Which makes sense as a node with the same id is still registered.

How can I (force) delete a runner, which is in such a state?

Hello,
Did you vacate allocator before demotion? If no, go to allocator's page and move all clusters' instances from it.
After this you should be able to remove host from the ECE cluster.
We don't support adding runners with the same name as removed one (there are some reasons I can't recall right now). The name of removed runner is stored in database for a lifetime of the ECE cluster. Therefore you have to use unique names every time you add a runner to the cluster. That's why we don't recommend using IP address of the hosts as runner ID.

Hi Yuri

I did not actively remove all nodes of the runner but there where none on it. If I navigate to the allocator page of this runner (by clicking on the allocator role in the runners details view) I cannot perform any action. I just get an error "Fetching allocator failed" and in the details it reports a 404.

What version of ECE do you use?

Currently version 1.1.0
Updates are blocked, as it tries to update the missing node as well.

Can you give some details?

  • You shut down the host where the runner ran, didn't you?
  • Allocator was empty on that host.

Now:

  • You still see the allocator on the page/region/ece-region/allocators
  • You still see the runner on the page /region/ece-region/runners
  • When you go to /region/ece-region/runners/{runnerId} you see error.

The allocator host was shutdown an the vm was deleted.
I'm not 100% sure if the allocator was empty. But as all clusters were fine, I assume it was empty or existing clusters on top of it have been deleted or migrated.

The allocator is not listed in the allocators overview (/region/ece-region/allocators)
the runner is listed in the runners overview (/region/ece-region/runners)
The runner detail view (/region/ece-region/runners/{runnerId}) reports the runner as not healthy.

Does the runner that you try to delete has a role coordinator assigned? If yes, try to remove it.

@Lafunamor

A few things:

  • Unfortunately (this is a known issue on our list to fix), you can never start a new runner with the same id. We recommend that the runner id looking something like "--", eg "allocator-10.1.1.4-abfe0132"
  • (the allocator disappearing is correct and means that you had no clusters left, the system auto-removes it)
  • The demote button means that the role has been a director in the past, is that correct? It sounds like the system has gotten a bit confused :frowning:

We're working on improving the UX flow of removing zookeeper (director) from runners. I think the problem was maybe that you moved the director off the allocator but then didn't demote it, and now the demote fails because the runner is dead, or something like that.

(Obviously it's a bug on our side, the current supported flow is something like: remove director, demote, then delete the runner ... probably remove director, delete runner leads to this unexpected state)

Apart from the annoyance of the extra runner, this bug should be "harmless". We can provide a script to remove that extra runner.

What are the roles currently listed under the defunct runner, out of interest

@Yuri I don't know if it one had it. I tried assigning and removing the role but the result is the same.

@Alex_Piggott
Thanks for the information. I was actually able to delete runners and then add new runners with the same ID.
The dead runner has the allocator role and the following containers: allocator, runner, beats-runner, client-forwarder, services-forwarder

As it's a test cluster anyways I'll just delete the whole cluster and recreate it. Nevertheless it would be nice to force delete such a runner.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.