My ECE had hosting related problems. I lost the host which was initially the first installed host. This left two remaining hosts. The ECE UI shows "There was an internal server error"
The platform was initially installed with v 2.4.1
I created a token to add a new host to the platform and went through the steps similar to the original install, however there was an error relating to the version of the downloaded install / bootstrap script which is v 2.4.3
Is it possible to download older versions of the ECE install / bootstrap script?
Or should I try to upgrade the fragments of the already compromised platform?
There are 3 common "sick ECE following outage" cases:
1] The platform always needs a live ZK quorum (eg 1/1 or 2+/3 or 3+/5 "director" roles), nothing will work until this is satisfied (any situation that isn't 0 live director roles can be recovered, but it can require a bit of hand editing of files)
2] The system cannot be controlled unless there is 1+ coordinator role .. this is what the "emergency token" fixes
3] If there are 0 instances of the system clusters running, some but not all of the UI breaks ... this can be fixed from the UI or API
Coming back to the original question .. if your system was running 2.4.1 then you should use the 2.4.1 install script ... this can easily be done by editing the current script file and setting the version number at the top (but would be good to understand which of the 3 states your system is in first)
wow my last natural choice would have been to fake the version on the current install script - although I did think of it!
The two hosts running have the same set of roles, including proxy, allocator, coordinator, director. Despite this the UI shows internal server error on most things.
I tried the upgrade after using the API to DELETE the allocator / runner that no longer existed, it failed and rolled back.
Had you moved the system clusters to being HA? It sounds like that's the problem (which is "just a UI" issue, and won't be fixed by installing the new version, though of course you will need to do that anyway at some point)
If this is the problem:
To fix the AC cluster being down, the steps are something like:
Restart the remaining 2 adminconsole services (docker restart frc-admin-consoles-admin-console)
Use the /clusters API to find the adminconsole-cluster cluster's id
Terminate (NOT DELETE) and restore that cluster via the API (/api/v1/clusters/elasticsearch/:id/_shutdown and _restore)
That should bring the UI back up (you can do this before or after re-installing the dead host)
Then check if the L+M cluster is also down and terminate/restore that also
Make sure they are both HA by making them 2+ zones
I found the ID of the admin-console-elasticsearch but wasn't allowed to shut it down -
"message": "You cannot change the cluster configuration of system-owned clusters. Reason: Your change would shut down the cluster that the system requires to operate, which is not permitted."
that's interesting, the GET of the AC cluster still references the allocator that no longer exists, and says it's unhealthy - but it says the remaining allocators are healthy. Overall the elasticsearch cluster is healthy. The UI says "Fetching region ece-region failed"
I'll go ahead and patch / restart the AC to see if it loses the reference to the missing allocator...
OK in that case don't do the _shutdown, it will fail to come back again with a capacity error until you have 3 healthy zones
(it referencing the dead allocator is 100% expected/normal ... it will show that instance as existing / being down until you resize down to 2 zones or _move the dead instance onto the new allocator)
On a UI page returning Fetching region ece-region failed can you look in the browser network debugger and see what API call is actually failing? That should isolate the problem
Is it possible you only had one proxy role set up? I see you already said you had 2 proxies left, this is very strange
after adding the -H 'X-Management-Request: true' I got
{"errors":[{"code":"root.unexpected_error","message":"There was an internal server error.","sub_code":"IAE","uuid":"91d1589a5623ac9d9f03f1eedfca68ef"}]}
Interesting, so the API server can't hit the AC cluster!
Can you hit the cluster directly? Eg GET $cluster_id.$host_root (/how you normally hit ES clusters) ... it should return a 401 or 403, vs timing out like it is when the API server hits it
OK so first point. When you said
" don't do the _shutdown , it will fail to come back again with a capacity error until you have 3 healthy zones"
I had already _shutdown / attempted _restart at that point. I did earlier manage to add the extra host after telling the install script to use the older version.
It looks like the _restart did not work though - the GET of the clusters/elasticsearch/:id shows it is still stopped.
Trying the _restart again...
Second point - if I'm trying to hit the admin console endpoint directly but the cluster is stopped, I shouldn't expect it to work?
You currently believe (the UI is not working so you aren't 100% sure?) that you have 3 ECE hosts representing 3 zones running (all with all the roles - or at least allocator/director/proxy/coordinator)
The AC cluster is shutdown and restarting it failed (you can use the GET ..../{cluster_id} should I believe give you details about where the plan failed, you may need to set show_plan_logs=true)
You have seen GET api/v0.1/regions/ece-region time out (which is breaking the UI), but other API responses all appear to work?
yes, on the three hosts docker ps shows stuff running on the old two; on the new one there's a much smaller list of things - beats runner, services forwarder, runners runner and client forwarder. Although I didn't see errors at the time of installing the new host, was the token too old perhaps?
GET /api/v1/platform/infrastructure/runners is currently timing out. (empty reply from server, after a long time)
The AC cluster is shut down, show plan logs is verbose
the last thing that succeeded was [add-shield-user]
after that it reached error on validate-plan-prerequisites
message": "Retry [10] in [60000 milliseconds]. 1. Could not ensure capacity allocation for [70bceb62fa7f480395ca2cb9d6573fa3] and [List(9f0164e1-729a-4719-b78d-20a7786ddcf8, 49993111-6909-438a-993e-f4ca9fb38b32, 419d5ad4-0fd6-4d1b-87aa-d747776bc0b3)] result was [{"type":"capacity","failed_instances":}]",
"message": "Unexpected error during step: [validate-plan-prerequisites]: [no.found.constructor.validation.ValidationException: 1. Could not ensure capacity allocation for [70bceb62fa7f480395ca2cb9d6573fa3] and [List(738a27aa-fb49-4437-b9da-f5de440029d4, 6f984a0b-89cb-41d4-a009-fa65985a98c3, 3f8a1127-b822-4a1d-8718-e95dbf008c65)] result was [{"type":"capacity","failed_instances":}]]",
Yes, GET for ece-region is timing out and everything else the UI tries in the meantime works.
Ah OK since the new runner isn't running the allocator role, it makes sense that the AC cluster is failing to come up
The new ECE host only having the base roles could be due the token or install settings being wrong, or could be due to some other problem (whatever the root cause of the regions call failing is) ... my guess is that it's token/install related though
Can you share what install settings you used and where exactly you got the token from (there were 3 generated at install time I think - emergency, allocator, and proxy?)
I don't think I used it within an hour because of the confusion over the version number, when I did get it to work I was following step 3 although I set the availability zone to MY_ZONE-1 to step into the gap left by the host that was terminated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.