Kibana Instances "Unable to Connect to Server"

IanGabes · October 16, 2017, 4:19pm

I have installed ECE on a few machines, and have set up 2 clusters, each with a seperate kibana instance. After about a week of normal operation, I attempted to access one of the kibana dashboards, to received an "internal server error" message, much like this issue. All the kibana instances across my clusters had this problem.

https://discuss.elastic.co/t/500-an-internal-server-error/?source_topic_id=104118

I then attempted to restart the Kibana isntance, which didn't work. I then attempted to delete and recreate the kibana instance, which changed the error message to "unable to connect to server".

I ran some of the commands requested for diagnosis in that issue, have have put them here:

gist.github.com

https://gist.github.com/IanGabes/9f0390e35494de65e7caadeab0451009

ECE DIAG

### INFRASTRUCTURE AND ALLOCATORS  ###

[
	{
		"zone_id": "ece-region-1a",
		"allocators": [
			{
				"public_hostname": "192.168.57.90",
				"instances": [
					{

This file has been truncated. show original

To be clear, the ES endpoints are still functional.

Thanks for any advice you can provide.

Alex_Piggott · October 17, 2017, 8:41am

@IanGabes Thanks for the attached diagnostic ... (UPDATE: this isn't actually your issue, leaving this up as a useful reference for people who do run out of space) the problem is that you have run out of disk space on your admin console cluster, causing queries to that cluster to fail, which the UI depends on.

If you have space on the allocator, you can use the API to up the capacity:

get /clusters/elasticsearch/ID/plan/current
save the json body in a text file
edit it to double the capacity
post that to /clusters/elasticsearch/ID/plan

If you don't have capacity there's a few options:

you can edit the "secret advanced" metadata to up the disk:ram ratio, there's another discuss post in which I describe how to do this, I'm on my phone so I'll link it in a separate post
you can shut some other clusters down first post to /clusters/elasticsearch/ID/_shutdown .. note this deletes all the data forever unless you have snapshots enabled

Alex_Piggott · October 17, 2017, 8:49am

The ram:disk ratio is described here: ECE RAM to Storage Ratio

To set that via the api

get the /clusters/elasticsearch/ID/metadata/raw And save the json body
edit the json to add the FS multiplier change as per the link
post that back to /clusters/elasticsearch/ID/metadata/raw

Alex_Piggott · October 17, 2017, 9:04am

@IanGabes

Having reread your post, I'm not sure that it is the same problem as https://discuss.elastic.co/t/500-an-internal-server-error/ ... can you confirm whether your ECE UI is working? (Which i believe is the root problem in the link)

Can you clarify - you are hitting the kibana endpoint directly (ie over the ECE proxy), not via the ECE UI, and it returned first a 500 and then (after you deleted and recreated the kibana via the ECE UI) failed to connect altogether?

Apologies for the confusion

Alex_Piggott · October 17, 2017, 9:57am

Can you look at the proxy logs and see if you are getting errors? Eg /mnt/data/elastic/RUNNERIP/services/proxy/logs/proxy.log

It's very strange that kibana isn't working but elasticsearch is though, the proxy treats them identically. When you restarted/stopped/recreated the kibana, did you get any errors reported?

I would check that you have disk space available on all the hosts.

The issue definitely feels like something is running out of disk (either the host, or some of the kibana containers, or both)

IanGabes · October 17, 2017, 1:52pm

Hey @Alex_Piggott thanks for your time.

My ECE UI still works, without problems. (I guess my eyes lit up when I saw 500 error in the title).
I am hitting the kibana proxy endpoint directly. I was first given a 500. After a restart, then a delete the it return a "could not find cluster". I have continually logged to ES before and after I have been working on these problems.
I received no errors during the recreation of the kibana instance.
I have 2TB disk space left across my allocators.

My proxy logs look like this, looks okay to my untrained eye:

gist.github.com

https://gist.github.com/IanGabes/dc1fed641f1b10e9bece6e2b9d3c482a

proxy.log

[2017-10-17 13:40:01,734][INFO ][org.apache.zookeeper.ZooKeeperTestable] injectSessionExpiration() called {}
[2017-10-17 13:40:01,734][WARN ][org.apache.curator.framework.state.ConnectionStateManager] Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 345348710. Adjusted session timeout ms: 15000 {}
[2017-10-17 13:40:01,737][INFO ][no.found.curator.ForwardedEnsembleProvider] Resolved connection string from [http://containerhost:2180/zookeeper/clients/ensemble/connection-string?namespace=/v1] to [containerhost:22191/v1] with local namespace [/v1] {}
[2017-10-17 13:40:01,737][INFO ][org.apache.zookeeper.ZooKeeperTestable] injectSessionExpiration() called {}
[2017-10-17 13:40:01,738][WARN ][org.apache.curator.framework.state.ConnectionStateManager] Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 345348714. Adjusted session timeout ms: 15000 {}
[2017-10-17 13:40:01,739][INFO ][no.found.curator.ForwardedEnsembleProvider] Resolved connection string from [http://containerhost:2180/zookeeper/clients/ensemble/connection-string?namespace=/v1] to [containerhost:22191/v1] with local namespace [/v1] {}
[2017-10-17 13:40:01,739][INFO ][org.apache.zookeeper.ZooKeeperTestable] injectSessionExpiration() called {}
[2017-10-17 13:40:01,739][WARN ][org.apache.curator.framework.state.ConnectionStateManager] Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 345348715. Adjusted session timeout ms: 15000 {}
[2017-10-17 13:40:01,739][INFO ][no.found.curator.ForwardedEnsembleProvider] Resolved connection string from [http://containerhost:2180/zookeeper/clients/ensemble/connection-string?namespace=/v1] to [containerhost:22191/v1] with local namespace [/v1] {}
[2017-10-17 13:40:01,739][INFO ][org.apache.zookeeper.ZooKeeperTestable] injectSessionExpiration() called {}

This file has been truncated. show original

I have gone ahead and reshuffled some resources around to give the admin cluster room to breathe.

Also, i noticed some errors on the proxy page of the ECE UI, but this just looks like some :

Here is an except from the json response (when I query the below enpoint manually)

/api/v0.1/regions/ece-region

...
	"proxies": {
		"healthy": false,
		"expected_proxies_count": 1,
		"proxies_count": 0,
		"proxies": {}
	},
...

Now that I think we found the problem I am not sure how to fix it : ( This doesn't make much sense to me because the ES endpoints continue to function:

{
  "name" : "instance-0000000002",
  "cluster_name" : "88b9105111cc4bffb6bf7077510c3780",
  "cluster_uuid" : "8eZNRjdYSQCqRPhNOjHweQ",
  "version" : {
    "number" : "5.6.1",
    "build_hash" : "667b497",
    "build_date" : "2017-09-14T19:22:05.189Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

Alex_Piggott · October 18, 2017, 9:32am

Hi @IanGabes

Sorry you're having problems. Thanks again for the detailed information you shared, which has really helped our diagnosis.

I grabbed a few people to discuss this case last night. It sounds a bit like an issue we've just started tracking, where the applications lose their connection to Zookeeper (the database of record).

Can you quickly grab the logs for the following:

/mnt/data/elastic/RUNNERIP/services/zookeeper/logs
/mnt/data/elastic/RUNNERIP/services/client-forwarder/logs

(These might not be exactly right, apologies; I'm not in the office) and look for any interesting timeouts or error messages?

Can you share a bit more about your architecture, eg what OS and kernel versions are you running, what is the underlying hardware, how many zookeeper processes (director roles in the UI) are you running, and is that on the allocators or one standalone machines?

In terms of workarounds, it should be possible to fix by restarting the machine running the proxy (the underlying issue I believe is in the client-forwarder service, but restarting that may also require restarting other services)

IanGabes · October 18, 2017, 1:37pm

@Alex_Piggott,

I am running allocators on three virtual hosts, each with 128GB of ram on the same network. We are running Ubuntu 16.04, kernel version 4.4.0-96, and docker version 1.12. I have 4TB of disk space shared between the hosts.

The client forwarder logs, with a cryptic "unexpected error".

gist.github.com

https://gist.github.com/IanGabes/a125e345bf55322443f22756fbc607a1

client-forwarder.log

[2017-09-28 17:28:23,482][INFO ][org.apache.curator.framework.imps.CuratorFrameworkImpl] Starting {}
[2017-09-28 17:28:23,507][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:zookeeper.version=3.5.1-alpha-1693007, built on 07/28/2015 07:19 GMT {}
[2017-09-28 17:28:23,507][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:host.name=ecs01-gw {}
[2017-09-28 17:28:23,507][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:java.version=1.8.0_121 {}
[2017-09-28 17:28:23,508][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:java.vendor=Oracle Corporation {}
[2017-09-28 17:28:23,508][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:java.home=/usr/lib/jvm/java-1.8-openjdk/jre {}
[2017-09-28 17:28:23,508][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:java.class.path=/elastic_cloud_apps/services.jar:. {}
[2017-09-28 17:28:23,508][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:java.library.path=/usr/lib/jvm/java-1.8-openjdk/jre/lib/amd64/server:/usr/lib/jvm/java-1.8-openjdk/jre/lib/amd64:/usr/lib/jvm/java-1.8-openjdk/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib {}
[2017-09-28 17:28:23,508][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:java.io.tmpdir=/tmp {}
[2017-09-28 17:28:23,508][INFO ][org.apache.zookeeper.ZooKeeper] Client environment:java.compiler=<NA> {}

This file has been truncated. show original

the zookeeper logs look normal to me:

gist.github.com

https://gist.github.com/IanGabes/741da593f479e9d92991f1708702cc6e

zookeeper.log

2017-09-28 17:28:22,120 [myid:] - INFO  [main:QuorumPeerConfig@116] - Reading configuration from: /elastic_cloud_apps/zookeeper/bin/../conf/zoo.cfg
2017-09-28 17:28:22,129 [myid:] - INFO  [main:QuorumPeerConfig@318] - clientPortAddress is 0.0.0.0/0.0.0.0:2191
2017-09-28 17:28:22,130 [myid:] - INFO  [main:QuorumPeerConfig@322] - secureClientPort is not set
2017-09-28 17:28:22,137 [myid:] - WARN  [main:QuorumPeerConfig@581] - No server failure will be tolerated. You need at least 3 servers.
2017-09-28 17:28:22,142 [myid:10] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2017-09-28 17:28:22,143 [myid:10] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1
2017-09-28 17:28:22,144 [myid:10] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2017-09-28 17:28:22,145 [myid:10] - INFO  [main:ManagedUtil@46] - Log4j found with jmx enabled.
2017-09-28 17:28:22,155 [myid:10] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2017-09-28 17:28:22,163 [myid:10] - INFO  [main:QuorumPeerMain@136] - Starting quorum peer

This file has been truncated. show original

Alex_Piggott · October 18, 2017, 2:06pm

Thanks for the info @IanGabes, did rebooting the affected instance (ie the host running the proxy) work?

Alex_Piggott · October 19, 2017, 10:07am

@IanGabes

We have been looking into this more (thanks again for the logs) - can you share any information about your disks (and also maybe the io that things like iostat and docker stats are reporting?)

Our current working theory is something like:

The system IO is overloaded (see the zookeeper logs for warnings like "sync took 2500ms") - this might be because of the co-located elasticsearch nodes, or because the disk is underpowered
as a result zookeeper on overloaded nodes isn't able to respond to clients on the same node, like the proxy, within the 10s timeout
other services (like the constructor) are connecting to different zookeeper nodes hence are able to continue functioning

Our recommended architecture is to run zookeeper in production on separate nodes to the allocator for this sort of reason. While evaluating it is OK to run them on the same nodes, provided there is sufficient IO.

Alex

IanGabes · October 30, 2017, 4:58pm

Alex, Sorry for the late reply. We had set up the ECE software as a trial run for an event we were running as a bit of a testing ground to see if ECE would be a good fit to replace our current production cluster environments. Unfortunately, due to the timing of the problems I was experiencing, I had to decommission the allocators and re-provision for a managed ES cluster deployment. I won't be able to properly get you the diagnostic information to help us out : (

The disks I was running the allocators on was on our "slow" storage 7200rpm disks, in a RAID-6 like configuration.

Thanks again for your help, I hope to get more bandwidth soon to test ECE out again, sorry for the disappointing troubleshooting session.

Alex_Piggott · October 30, 2017, 6:52pm

@IanGabes

Not at all - thanks for the all the help you did provide, and sorry you ran into issues

I think we ended up deciding internally that your issue was probably due to co-locating the director role (ie Zookeeper) and other bandwidth intensive services (ie Elasticsearch) - we're updating our recommended architecture to reflect this ... we were able to reproduce similar issues on (slower) spinning disks (or IOPS-limited SSDs)

Please do reach out here or via any Elastic rep/contacts you might have next time you have a chance to play with ECE!

Alex

system · November 13, 2017, 6:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ECE cant connect to my instances Elastic Cloud Enterprise (ECE)	1	481	May 1, 2022
Error when opening kibana Elastic Cloud Enterprise (ECE)	1	425	December 27, 2022
Fetching clusters failed Elasticsearch	1	892	September 24, 2018
Kibana : Error fetching users: An internal server error occurred Elastic Cloud Enterprise (ECE)	4	984	August 26, 2020
Kibana - No Instances Running Elastic Cloud Enterprise (ECE)	8	2456	November 7, 2019

Kibana Instances "Unable to Connect to Server"

Related topics