I'm testing ece 2.0.1 on aws. My deployment is a small one with three ece node with full roles.
One ece node was terminated.
Ece should have high availability even one node is down. But the reality is I cannot get anything from cloud ui.
It just told me that "The requested cluster is currently unavailable" like below.
It should indeed be HA, that's a configuration that gets used/test a lot
As you implied, the problem is that the proxy is not routing to the adminconsole cluster for some reason. I think you answered this implicitly, but could you just confirm that you had expanded the adminconsole cluster to 3 zones? (With all 3 data zones, or 2 data zones a tie-breaker?)
When you can access the cluster directly (did i understand that right, you got the port from docker and hit that directly?) but not via the proxy, it normally means one of two things:
The cluster can't elect a single master, eg quorum size is too high - can you GET _cluster/settings and GET _cat/nodes to see if that's the case via the direct connection
So ... the third reason for the issues you describe is indeed that the proxy table is not being built correctly, as you suggested - I didn't mention it before because it's pretty rare, normally involving the persistence store (Zookeeper) being in an unhealthy state (eg out of ZK quorum)
Based on the log message, the proxy doesn't believe the cluster has ES quorum, which likely means that it has the wrong master (because it's reading from a stale ZK store). Zookeeper being unhealthy in some way could also cause slowness from some API requests/UI pages (ie the ones that aren't cached)
Do the logs for frc-zookeepers-zookeeper show anything interesting?
Yeah at a quick glance it looks like ZK12 couldn't connect to ZK11 so lost quorum when ZK10 went down.
ZK has occasionally been flakey when people have used odd cluster sizes (like 4) but I've never seen it go down when 1 node out of 3 has been lost
When I run into this sort of thing, normally I hand edit the config files (in increasing Id order) to bring the zk cluster to 1 node, get it running in standalone mode, etc - then add the second node (I have notes I can dig out tomorrow).
Other people have told me that restarting both frc-directors-director and frc-zookeepers-zookeeper (zk first I think) has worked for them - that's probably an easier place to start!
Issue is fixed after restart director and zookeeper service.
But if this happens in a production environment, this should not be a good solution . right?
So as I understand, this problem is caused by zookeeper. When one zookeeper participant leaves suddenly, the other participants cannot make a healthy zookeeper ensemble again. This brings dirty data which is not the same with the actual cluster state.
So proxy think that the es cluster has no quorum because proxy just check cluster state from zookeeper store.
As I mentioned in a previous post, in my experience what happened to you is pretty rare - I typically know about / help fix any ZK outages in ~100 or so ensembles we own or manage over the last 2 years ... I think there have been 5-6 similar issues and maybe 1 of them wasn't attributable to users manually going out of quorum, or at least some unusual ZK config (eg even quorum size, overloaded ZK, etc)
In your case, based on your description/the logs I can't see any possible configuration issues that can explain what happened. So it seems to fall into the non-ideal "ZK flakiness" category, which empirically has been decent, and of course ZK is a fairly common platform in production environments (though of course you're at 1/1 failures, so have cause to doubt that!)
For my interest, could you also share the logs on the other good zone ("id 11" in the ZK logs)?