Experience Unavailable Cluster Issue

Hi,
I'm testing ece 2.0.1 on aws. My deployment is a small one with three ece node with full roles.

One ece node was terminated.
Ece should have high availability even one node is down. But the reality is I cannot get anything from cloud ui.
It just told me that "The requested cluster is currently unavailable" like below.

I try to find the reason behind the issue.
If I connect to admin console es cluster, everything goes ok.
But if I connect by proxy, it tells me "The requested cluster is currently unavailable".

This is weird but I don't know why proxy report unavailable cluster while this cluster is in green status.

Please help!

It should indeed be HA, that's a configuration that gets used/test a lot

As you implied, the problem is that the proxy is not routing to the adminconsole cluster for some reason. I think you answered this implicitly, but could you just confirm that you had expanded the adminconsole cluster to 3 zones? (With all 3 data zones, or 2 data zones a tie-breaker?)

When you can access the cluster directly (did i understand that right, you got the port from docker and hit that directly?) but not via the proxy, it normally means one of two things:

Alex

Alex, Big thanks for your reply.
Your understand is right.

The admin console cluster seems ok like below.
[root@ip-172-31-27-255 ~]# curl http://172.31.5.83:18777/_cluster/settings?pretty=true -uelastic -p
Enter host password for user 'elastic':
{
"persistent" : {
"discovery" : {
"zen" : {
"minimum_master_nodes" : "2"
}
}
},
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"awareness" : {
"attributes" : "region,availability_zone,logical_availability_zone"
},
"enable" : "all",
"exclude" : {
"_name" : "no_instances_excluded"
}
}
}
},
"discovery" : {
"zen" : {
"minimum_master_nodes" : "2"
}
}
}
}
[root@ip-172-31-27-255 ~]# curl http://172.31.5.83:18777/_cat/nodes?pretty=true -uelastic -p
Enter host password for user 'elastic':
172.31.5.83 44 78 0 1.04 0.29 0.13 mdi * instance-0000000001
172.31.13.85 9 57 0 0.00 0.01 0.05 mdi - instance-0000000002
[root@ip-172-31-27-255 ~]# curl http://172.31.5.83:18777/_cat/indices?pretty=true -uelastic -p
Enter host password for user 'elastic':
green open v1-allocators-ece-region KVj9SYbjT1-dKeZjQnpx9A 1 1 3 2 55.6kb 27.8kb
green open container-sets-ece-region ghjmB0ZJRLyIL2IqOPVAnw 1 1 14 2 470.6kb 235.4kb
green open runners-ece-region pKSpZmbEQm6035yi-88l2g 1 1 3 1 462.9kb 231.4kb
green open clusters-ece-region uCDK0drLSNCy6WJbuWvixg 1 1 3 1 1mb 498.3kb
green open v1-kibana-clusters-ece-region m7bPJn91RKeFuQ8jFnx9Jg 1 1 2 0 72.8kb 34.6kb
green open allocators-ece-region hCLTJHnjTNGAo80o939tgw 1 1 3 2 293.6kb 146.8kb
green open constructors-ece-region lzh9tMLVR4iXt2x1nO9UHg 1 1 2 1 163.2kb 83.3kb
green open v1-elasticsearch-clusters-ece-region FhYNud-1ScCAB0z7Wc4k9A 1 1 9 6 244kb 122kb
green open v1-runners-ece-region ujXU0X9fT3Sn_wHakFX6OQ 1 1 54 14 40kb 20kb
green open kibana-clusters-ece-region WO4e4MAeSWeTsDqQQca39g 1 1 2 7 334.3kb 167.1kb

The cluster info through ece api is like below
https://gist.github.com/rockybean/e39a75a825208e981836229b56d2da5c

As I can see, the maintenance mode is all off.

At the same time, admin console api service become very slow. It takes me more than 1 minutes to call some api.

Any other tips? thanks!

@Alex_Piggott I'm also wondering if there is routing table problem for proxy.
How does ece proxy build the routing table?

In addition, I also see some warning in proxy log.

So ... the third reason for the issues you describe is indeed that the proxy table is not being built correctly, as you suggested - I didn't mention it before because it's pretty rare, normally involving the persistence store (Zookeeper) being in an unhealthy state (eg out of ZK quorum)

Based on the log message, the proxy doesn't believe the cluster has ES quorum, which likely means that it has the wrong master (because it's reading from a stale ZK store). Zookeeper being unhealthy in some way could also cause slowness from some API requests/UI pages (ie the ones that aren't cached)

Do the logs for frc-zookeepers-zookeeper show anything interesting?

Alex

May be this is caused by the sudden node left event.I will try to catch some logs in zookeeper log. wait for me~

@Alex_Piggott
Please check logs below.

Please see logs after
2019-01-16 13:04:08,419 with a warning sign

This is the time when one node leaves.

How could I quickly fix this issue?

Yeah at a quick glance it looks like ZK12 couldn't connect to ZK11 so lost quorum when ZK10 went down.

ZK has occasionally been flakey when people have used odd cluster sizes (like 4) but I've never seen it go down when 1 node out of 3 has been lost

When I run into this sort of thing, normally I hand edit the config files (in increasing Id order) to bring the zk cluster to 1 node, get it running in standalone mode, etc - then add the second node (I have notes I can dig out tomorrow).

Other people have told me that restarting both frc-directors-director and frc-zookeepers-zookeeper (zk first I think) has worked for them - that's probably an easier place to start!

Thanks!
I will try to restart director and zk firstly.

And I'm also curious about your hand edit operation. Look forward to your notes

@Alex_Piggott

@Alex_Piggott
Issue is fixed after restart director and zookeeper service.

But if this happens in a production environment, this should not be a good solution . right?

So as I understand, this problem is caused by zookeeper. When one zookeeper participant leaves suddenly, the other participants cannot make a healthy zookeeper ensemble again. This brings dirty data which is not the same with the actual cluster state.
So proxy think that the es cluster has no quorum because proxy just check cluster state from zookeeper store.

Am I right?

Yes that appears to be essentially what happened.

As I mentioned in a previous post, in my experience what happened to you is pretty rare - I typically know about / help fix any ZK outages in ~100 or so ensembles we own or manage over the last 2 years ... I think there have been 5-6 similar issues and maybe 1 of them wasn't attributable to users manually going out of quorum, or at least some unusual ZK config (eg even quorum size, overloaded ZK, etc)

In your case, based on your description/the logs I can't see any possible configuration issues that can explain what happened. So it seems to fall into the non-ideal "ZK flakiness" category, which empirically has been decent, and of course ZK is a fairly common platform in production environments (though of course you're at 1/1 failures, so have cause to doubt that!)

For my interest, could you also share the logs on the other good zone ("id 11" in the ZK logs)?

Alex

1 Like

Thanks for you explanation.

Logs has gone because I reinstalled ece due to my team test requirements.

If I meet this kind of issue again, let us discuss then.

thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.