Experience Unavailable Cluster Issue

rockybean · January 16, 2019, 4:48pm

Hi,
I'm testing ece 2.0.1 on aws. My deployment is a small one with three ece node with full roles.

One ece node was terminated.
Ece should have high availability even one node is down. But the reality is I cannot get anything from cloud ui.
It just told me that "The requested cluster is currently unavailable" like below.

I try to find the reason behind the issue.
If I connect to admin console es cluster, everything goes ok.
But if I connect by proxy, it tells me "The requested cluster is currently unavailable".

This is weird but I don't know why proxy report unavailable cluster while this cluster is in green status.

Please help!

Alex_Piggott · January 16, 2019, 5:23pm

It should indeed be HA, that's a configuration that gets used/test a lot

As you implied, the problem is that the proxy is not routing to the adminconsole cluster for some reason. I think you answered this implicitly, but could you just confirm that you had expanded the adminconsole cluster to 3 zones? (With all 3 data zones, or 2 data zones a tie-breaker?)

When you can access the cluster directly (did i understand that right, you got the port from docker and hit that directly?) but not via the proxy, it normally means one of two things:

The cluster can't elect a single master, eg quorum size is too high - can you GET _cluster/settings and GET _cat/nodes to see if that's the case via the direct connection
All nodes that are up have maintenance mode set (so the proxy refuses to route to them) ... I think performing GET -u admin:$PASSWORD 'http://localhost:12400/api/v1/clusters/$THE_ADMIN_CLUSTER_ID should tell you if that is the case or not (https://www.elastic.co/guide/en/cloud-enterprise/2.0/get-es-cluster.html -> https://www.elastic.co/guide/en/cloud-enterprise/2.0/ElasticsearchClusterInfo.html, which will also give some additional cluster and instance info)

Alex

rockybean · January 16, 2019, 11:54pm

Alex, Big thanks for your reply.
Your understand is right.

The admin console cluster seems ok like below.
[root@ip-172-31-27-255 ~]# curl http://172.31.5.83:18777/_cluster/settings?pretty=true -uelastic -p
Enter host password for user 'elastic':
{
"persistent" : {
"discovery" : {
"zen" : {
"minimum_master_nodes" : "2"
}
}
},
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"awareness" : {
"attributes" : "region,availability_zone,logical_availability_zone"
},
"enable" : "all",
"exclude" : {
"_name" : "no_instances_excluded"
}
}
}
},
"discovery" : {
"zen" : {
"minimum_master_nodes" : "2"
}
}
}
}
[root@ip-172-31-27-255 ~]# curl http://172.31.5.83:18777/_cat/nodes?pretty=true -uelastic -p
Enter host password for user 'elastic':
172.31.5.83 44 78 0 1.04 0.29 0.13 mdi * instance-0000000001
172.31.13.85 9 57 0 0.00 0.01 0.05 mdi - instance-0000000002
[root@ip-172-31-27-255 ~]# curl http://172.31.5.83:18777/_cat/indices?pretty=true -uelastic -p
Enter host password for user 'elastic':
green open v1-allocators-ece-region KVj9SYbjT1-dKeZjQnpx9A 1 1 3 2 55.6kb 27.8kb
green open container-sets-ece-region ghjmB0ZJRLyIL2IqOPVAnw 1 1 14 2 470.6kb 235.4kb
green open runners-ece-region pKSpZmbEQm6035yi-88l2g 1 1 3 1 462.9kb 231.4kb
green open clusters-ece-region uCDK0drLSNCy6WJbuWvixg 1 1 3 1 1mb 498.3kb
green open v1-kibana-clusters-ece-region m7bPJn91RKeFuQ8jFnx9Jg 1 1 2 0 72.8kb 34.6kb
green open allocators-ece-region hCLTJHnjTNGAo80o939tgw 1 1 3 2 293.6kb 146.8kb
green open constructors-ece-region lzh9tMLVR4iXt2x1nO9UHg 1 1 2 1 163.2kb 83.3kb
green open v1-elasticsearch-clusters-ece-region FhYNud-1ScCAB0z7Wc4k9A 1 1 9 6 244kb 122kb
green open v1-runners-ece-region ujXU0X9fT3Sn_wHakFX6OQ 1 1 54 14 40kb 20kb
green open kibana-clusters-ece-region WO4e4MAeSWeTsDqQQca39g 1 1 2 7 334.3kb 167.1kb

The cluster info through ece api is like below
https://gist.github.com/rockybean/e39a75a825208e981836229b56d2da5c

As I can see, the maintenance mode is all off.

At the same time, admin console api service become very slow. It takes me more than 1 minutes to call some api.

Any other tips? thanks!

rockybean · January 17, 2019, 12:04am

@Alex_Piggott I'm also wondering if there is routing table problem for proxy.
How does ece proxy build the routing table?

rockybean · January 17, 2019, 12:19am

In addition, I also see some warning in proxy log.

Alex_Piggott · January 17, 2019, 12:42am

So ... the third reason for the issues you describe is indeed that the proxy table is not being built correctly, as you suggested - I didn't mention it before because it's pretty rare, normally involving the persistence store (Zookeeper) being in an unhealthy state (eg out of ZK quorum)

Based on the log message, the proxy doesn't believe the cluster has ES quorum, which likely means that it has the wrong master (because it's reading from a stale ZK store). Zookeeper being unhealthy in some way could also cause slowness from some API requests/UI pages (ie the ones that aren't cached)

Do the logs for frc-zookeepers-zookeeper show anything interesting?

Alex

rockybean · January 17, 2019, 12:51am

May be this is caused by the sudden node left event.I will try to catch some logs in zookeeper log. wait for me~

rockybean · January 17, 2019, 12:52am

@Alex_Piggott
Please check logs below.

gist.github.com

https://gist.github.com/rockybean/125c57fca591fff93ce51bbd3bbd88bf

ece-zk-log

2019-01-16 06:10:21,496 [myid:] - INFO  [main:QuorumPeerConfig@117] - Reading configuration from: /elastic_cloud_apps/zookeeper/bin/../conf/zoo.cfg
2019-01-16 06:10:21,504 [myid:] - INFO  [main:QuorumPeerConfig@327] - clientPortAddress is 0.0.0.0/0.0.0.0:2193
2019-01-16 06:10:21,504 [myid:] - INFO  [main:QuorumPeerConfig@331] - secureClientPort is not set
2019-01-16 06:10:21,511 [myid:] - WARN  [main:QuorumPeerConfig@590] - No server failure will be tolerated. You need at least 3 servers.
2019-01-16 06:10:21,514 [myid:12] - WARN  [main:QuorumPeerConfig@655] - Peer type from servers list (OBSERVER) doesn't match peerType (PARTICIPANT). Defaulting to servers list.
2019-01-16 06:10:21,516 [myid:12] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2019-01-16 06:10:21,516 [myid:12] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1
2019-01-16 06:10:21,518 [myid:12] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2019-01-16 06:10:21,518 [myid:12] - INFO  [main:ManagedUtil@46] - Log4j found with jmx enabled.
2019-01-16 06:10:21,532 [myid:12] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.

This file has been truncated. show original

rockybean · January 17, 2019, 12:54am

Please see logs after
2019-01-16 13:04:08,419 with a warning sign

This is the time when one node leaves.

rockybean · January 17, 2019, 12:58am

How could I quickly fix this issue?

Alex_Piggott · January 17, 2019, 1:23am

Yeah at a quick glance it looks like ZK12 couldn't connect to ZK11 so lost quorum when ZK10 went down.

ZK has occasionally been flakey when people have used odd cluster sizes (like 4) but I've never seen it go down when 1 node out of 3 has been lost

When I run into this sort of thing, normally I hand edit the config files (in increasing Id order) to bring the zk cluster to 1 node, get it running in standalone mode, etc - then add the second node (I have notes I can dig out tomorrow).

Other people have told me that restarting both frc-directors-director and frc-zookeepers-zookeeper (zk first I think) has worked for them - that's probably an easier place to start!

rockybean · January 17, 2019, 1:35am

Thanks!
I will try to restart director and zk firstly.

And I'm also curious about your hand edit operation. Look forward to your notes

@Alex_Piggott

rockybean · January 17, 2019, 2:01am

@Alex_Piggott
Issue is fixed after restart director and zookeeper service.

But if this happens in a production environment, this should not be a good solution . right?

So as I understand, this problem is caused by zookeeper. When one zookeeper participant leaves suddenly, the other participants cannot make a healthy zookeeper ensemble again. This brings dirty data which is not the same with the actual cluster state.
So proxy think that the es cluster has no quorum because proxy just check cluster state from zookeeper store.

Am I right?

Alex_Piggott · January 17, 2019, 2:40pm

Yes that appears to be essentially what happened.

As I mentioned in a previous post, in my experience what happened to you is pretty rare - I typically know about / help fix any ZK outages in ~100 or so ensembles we own or manage over the last 2 years ... I think there have been 5-6 similar issues and maybe 1 of them wasn't attributable to users manually going out of quorum, or at least some unusual ZK config (eg even quorum size, overloaded ZK, etc)

In your case, based on your description/the logs I can't see any possible configuration issues that can explain what happened. So it seems to fall into the non-ideal "ZK flakiness" category, which empirically has been decent, and of course ZK is a fairly common platform in production environments (though of course you're at 1/1 failures, so have cause to doubt that!)

For my interest, could you also share the logs on the other good zone ("id 11" in the ZK logs)?

Alex

rockybean · January 18, 2019, 12:43am

Thanks for you explanation.

Logs has gone because I reinstalled ece due to my team test requirements.

If I meet this kind of issue again, let us discuss then.

thanks

system · February 1, 2019, 12:43am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ECE Unhealthy platform, internal server error Elastic Cloud Enterprise (ECE)	42	5644	April 8, 2020
Kibana Instances "Unable to Connect to Server" Elastic Cloud Enterprise (ECE)	12	4364	November 13, 2017
Cluster availability Elasticsearch	5	379	July 6, 2017
Fetching clusters failed Elasticsearch	1	920	September 24, 2018
EC2 Discovery Elasticsearch	7	531	July 6, 2017

Experience Unavailable Cluster Issue

Related topics