ECE Unhealthy platform, internal server error

My ECE had hosting related problems. I lost the host which was initially the first installed host. This left two remaining hosts. The ECE UI shows "There was an internal server error"

The platform was initially installed with v 2.4.1

I created a token to add a new host to the platform and went through the steps similar to the original install, however there was an error relating to the version of the downloaded install / bootstrap script which is v 2.4.3

Is it possible to download older versions of the ECE install / bootstrap script?

Or should I try to upgrade the fragments of the already compromised platform?

Any tips?

What roles do you have on the 2 remaining hosts?

There are 3 common "sick ECE following outage" cases:

1] The platform always needs a live ZK quorum (eg 1/1 or 2+/3 or 3+/5 "director" roles), nothing will work until this is satisfied (any situation that isn't 0 live director roles can be recovered, but it can require a bit of hand editing of files)

2] The system cannot be controlled unless there is 1+ coordinator role .. this is what the "emergency token" fixes

3] If there are 0 instances of the system clusters running, some but not all of the UI breaks ... this can be fixed from the UI or API

Coming back to the original question .. if your system was running 2.4.1 then you should use the 2.4.1 install script ... this can easily be done by editing the current script file and setting the version number at the top (but would be good to understand which of the 3 states your system is in first)

Alex

Thanks Alex,

wow my last natural choice would have been to fake the version on the current install script - although I did think of it!

The two hosts running have the same set of roles, including proxy, allocator, coordinator, director. Despite this the UI shows internal server error on most things.

I tried the upgrade after using the API to DELETE the allocator / runner that no longer existed, it failed and rolled back.

I'll try faking the script version now...

Actually you don't need to fake it ... there's a version option in the CLI: https://www.elastic.co/guide/en/cloud-enterprise/current/ece-installation-script.html (which is equivalent to what i suggested but obviously less hacky looking!)

Had you moved the system clusters to being HA? It sounds like that's the problem (which is "just a UI" issue, and won't be fixed by installing the new version, though of course you will need to do that anyway at some point)

If this is the problem:

To fix the AC cluster being down, the steps are something like:

  • Restart the remaining 2 adminconsole services (docker restart frc-admin-consoles-admin-console)
  • Use the /clusters API to find the adminconsole-cluster cluster's id
  • Terminate (NOT DELETE) and restore that cluster via the API (/api/v1/clusters/elasticsearch/:id/_shutdown and _restore)

That should bring the UI back up (you can do this before or after re-installing the dead host)

Then check if the L+M cluster is also down and terminate/restore that also

Make sure they are both HA by making them 2+ zones

Alex, thanks again...

I found the ID of the admin-console-elasticsearch but wasn't allowed to shut it down -
"message": "You cannot change the cluster configuration of system-owned clusters. Reason: Your change would shut down the cluster that the system requires to operate, which is not permitted."

Is there a trick to this?

Oh sorry, forgot about that!

This should enable you to shut it down:

curl -u USER:PASS -XPATCH -H "content-type: application/json" "localhost:12400/api/v1/clusters/elasticsearch/ID/metadata/settings" -d '{ "system_owned": false }'

Once you've restored it, you can run the same command to bring it back to under the "system owned" umbrella

Before doing it - can you confirm that your system clusters weren't HA?

Alex,

I was fairly sure the system clusters were HA - think I was following this: https://www.elastic.co/guide/en/cloud-enterprise/current/ece-topology-example1.html although I may have made mistakes along the way possibly...
I was using 3AZs

Hmm interesting ... if you do GET /api/v1/clusters/elasticseach/ID it should tell you if it's healthy and why it's not if not?

The fact that the API works and the UI doesn't is normally a dead giveaway that it's the AC cluster that is down though

Alex,

that's interesting, the GET of the AC cluster still references the allocator that no longer exists, and says it's unhealthy - but it says the remaining allocators are healthy. Overall the elasticsearch cluster is healthy. The UI says "Fetching region ece-region failed"

I'll go ahead and patch / restart the AC to see if it loses the reference to the missing allocator...

OK in that case don't do the _shutdown, it will fail to come back again with a capacity error until you have 3 healthy zones

(it referencing the dead allocator is 100% expected/normal ... it will show that instance as existing / being down until you resize down to 2 zones or _move the dead instance onto the new allocator)

On a UI page returning Fetching region ece-region failed can you look in the browser network debugger and see what API call is actually failing? That should isolate the problem

Is it possible you only had one proxy role set up? I see you already said you had 2 proxies left, this is very strange

Alex

good tip, I tried and saw the fetching region timed out.

api/v0.1/regions/ece-region GET 504 connection timed out

other requests to other endpoints are working?

What does GET /api/v1/clusters/elasticsearch/{cluster_id}/proxy/ return? (doc link)

That should conclusively prove if the AC cluster + proxy infra is problematic or not

Alex

after adding the -H 'X-Management-Request: true' I got
{"errors":[{"code":"root.unexpected_error","message":"There was an internal server error.","sub_code":"IAE","uuid":"91d1589a5623ac9d9f03f1eedfca68ef"}]}

Interesting, so the API server can't hit the AC cluster!

Can you hit the cluster directly? Eg GET $cluster_id.$host_root (/how you normally hit ES clusters) ... it should return a 401 or 403, vs timing out like it is when the API server hits it

OK so first point. When you said
" don't do the _shutdown , it will fail to come back again with a capacity error until you have 3 healthy zones"
I had already _shutdown / attempted _restart at that point. I did earlier manage to add the extra host after telling the install script to use the older version.
It looks like the _restart did not work though - the GET of the clusters/elasticsearch/:id shows it is still stopped.
Trying the _restart again...

Second point - if I'm trying to hit the admin console endpoint directly but the cluster is stopped, I shouldn't expect it to work?

OK so to summarize the situation:

  • You currently believe (the UI is not working so you aren't 100% sure?) that you have 3 ECE hosts representing 3 zones running (all with all the roles - or at least allocator/director/proxy/coordinator)
  • The AC cluster is shutdown and restarting it failed (you can use the GET ..../{cluster_id} should I believe give you details about where the plan failed, you may need to set show_plan_logs=true)
  • You have seen GET api/v0.1/regions/ece-region time out (which is breaking the UI), but other API responses all appear to work?

Alex,

yes, on the three hosts docker ps shows stuff running on the old two; on the new one there's a much smaller list of things - beats runner, services forwarder, runners runner and client forwarder. Although I didn't see errors at the time of installing the new host, was the token too old perhaps?
GET /api/v1/platform/infrastructure/runners is currently timing out. (empty reply from server, after a long time)

The AC cluster is shut down, show plan logs is verbose
the last thing that succeeded was [add-shield-user]
after that it reached error on validate-plan-prerequisites
message": "Retry [10] in [60000 milliseconds]. 1. Could not ensure capacity allocation for [70bceb62fa7f480395ca2cb9d6573fa3] and [List(9f0164e1-729a-4719-b78d-20a7786ddcf8, 49993111-6909-438a-993e-f4ca9fb38b32, 419d5ad4-0fd6-4d1b-87aa-d747776bc0b3)] result was [{"type":"capacity","failed_instances":}]",
"message": "Unexpected error during step: [validate-plan-prerequisites]: [no.found.constructor.validation.ValidationException: 1. Could not ensure capacity allocation for [70bceb62fa7f480395ca2cb9d6573fa3] and [List(738a27aa-fb49-4437-b9da-f5de440029d4, 6f984a0b-89cb-41d4-a009-fa65985a98c3, 3f8a1127-b822-4a1d-8718-e95dbf008c65)] result was [{"type":"capacity","failed_instances":}]]",

Yes, GET for ece-region is timing out and everything else the UI tries in the meantime works.

Ah OK since the new runner isn't running the allocator role, it makes sense that the AC cluster is failing to come up

The new ECE host only having the base roles could be due the token or install settings being wrong, or could be due to some other problem (whatever the root cause of the regions call failing is) ... my guess is that it's token/install related though

Can you share what install settings you used and where exactly you got the token from (there were 3 generated at install time I think - emergency, allocator, and proxy?)

Alex,

I generated the token this morning sometime, following the command here: https://www.elastic.co/guide/en/cloud-enterprise/current/ece-topology-example1.html

  1. curl -k -H 'Content-Type: application/json' -u admin:PASSWORD https://localhost:12443/api/v1/platform/configuration/security/enrollment-tokens -d '{ "persistent": false, "roles": ["director", "coordinator", "proxy", "allocator"] }'

I don't think I used it within an hour because of the confusion over the version number, when I did get it to work I was following step 3 although I set the availability zone to MY_ZONE-1 to step into the gap left by the host that was terminated.

Thanks again for all the help with this.