Kibana Instances "Unable to Connect to Server"

I have installed ECE on a few machines, and have set up 2 clusters, each with a seperate kibana instance. After about a week of normal operation, I attempted to access one of the kibana dashboards, to received an "internal server error" message, much like this issue. All the kibana instances across my clusters had this problem.

https://discuss.elastic.co/t/500-an-internal-server-error/?source_topic_id=104118

I then attempted to restart the Kibana isntance, which didn't work. I then attempted to delete and recreate the kibana instance, which changed the error message to "unable to connect to server".

I ran some of the commands requested for diagnosis in that issue, have have put them here:

To be clear, the ES endpoints are still functional.

Thanks for any advice you can provide.

@IanGabes Thanks for the attached diagnostic ... (UPDATE: this isn't actually your issue, leaving this up as a useful reference for people who do run out of space) the problem is that you have run out of disk space on your admin console cluster, causing queries to that cluster to fail, which the UI depends on.

If you have space on the allocator, you can use the API to up the capacity:

  • get /clusters/elasticsearch/ID/plan/current
  • save the json body in a text file
  • edit it to double the capacity
  • post that to /clusters/elasticsearch/ID/plan

If you don't have capacity there's a few options:

  • you can edit the "secret advanced" metadata to up the disk:ram ratio, there's another discuss post in which I describe how to do this, I'm on my phone so I'll link it in a separate post
  • you can shut some other clusters down first :slight_smile: post to /clusters/elasticsearch/ID/_shutdown .. note this deletes all the data forever unless you have snapshots enabled

The ram:disk ratio is described here: ECE RAM to Storage Ratio

To set that via the api

  • get the /clusters/elasticsearch/ID/metadata/raw And save the json body
  • edit the json to add the FS multiplier change as per the link
  • post that back to /clusters/elasticsearch/ID/metadata/raw

@IanGabes

Having reread your post, I'm not sure that it is the same problem as https://discuss.elastic.co/t/500-an-internal-server-error/ ... can you confirm whether your ECE UI is working? (Which i believe is the root problem in the link)

Can you clarify - you are hitting the kibana endpoint directly (ie over the ECE proxy), not via the ECE UI, and it returned first a 500 and then (after you deleted and recreated the kibana via the ECE UI) failed to connect altogether?

Apologies for the confusion

Can you look at the proxy logs and see if you are getting errors? Eg /mnt/data/elastic/RUNNERIP/services/proxy/logs/proxy.log

It's very strange that kibana isn't working but elasticsearch is though, the proxy treats them identically. When you restarted/stopped/recreated the kibana, did you get any errors reported?

I would check that you have disk space available on all the hosts.

The issue definitely feels like something is running out of disk (either the host, or some of the kibana containers, or both)

Hey @Alex_Piggott thanks for your time.

  1. My ECE UI still works, without problems. (I guess my eyes lit up when I saw 500 error in the title).
  2. I am hitting the kibana proxy endpoint directly. I was first given a 500. After a restart, then a delete the it return a "could not find cluster". I have continually logged to ES before and after I have been working on these problems.
  3. I received no errors during the recreation of the kibana instance.
  4. I have 2TB disk space left across my allocators.

My proxy logs look like this, looks okay to my untrained eye:

I have gone ahead and reshuffled some resources around to give the admin cluster room to breathe.

Also, i noticed some errors on the proxy page of the ECE UI, but this just looks like some :

Here is an except from the json response (when I query the below enpoint manually)

/api/v0.1/regions/ece-region

...
	"proxies": {
		"healthy": false,
		"expected_proxies_count": 1,
		"proxies_count": 0,
		"proxies": {}
	},
...

Now that I think we found the problem I am not sure how to fix it : ( This doesn't make much sense to me because the ES endpoints continue to function:

{
  "name" : "instance-0000000002",
  "cluster_name" : "88b9105111cc4bffb6bf7077510c3780",
  "cluster_uuid" : "8eZNRjdYSQCqRPhNOjHweQ",
  "version" : {
    "number" : "5.6.1",
    "build_hash" : "667b497",
    "build_date" : "2017-09-14T19:22:05.189Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

Hi @IanGabes

Sorry you're having problems. Thanks again for the detailed information you shared, which has really helped our diagnosis.

I grabbed a few people to discuss this case last night. It sounds a bit like an issue we've just started tracking, where the applications lose their connection to Zookeeper (the database of record).

Can you quickly grab the logs for the following:

  • /mnt/data/elastic/RUNNERIP/services/zookeeper/logs
  • /mnt/data/elastic/RUNNERIP/services/client-forwarder/logs

(These might not be exactly right, apologies; I'm not in the office) and look for any interesting timeouts or error messages?

Can you share a bit more about your architecture, eg what OS and kernel versions are you running, what is the underlying hardware, how many zookeeper processes (director roles in the UI) are you running, and is that on the allocators or one standalone machines?

In terms of workarounds, it should be possible to fix by restarting the machine running the proxy (the underlying issue I believe is in the client-forwarder service, but restarting that may also require restarting other services)

@Alex_Piggott,

I am running allocators on three virtual hosts, each with 128GB of ram on the same network. We are running Ubuntu 16.04, kernel version 4.4.0-96, and docker version 1.12. I have 4TB of disk space shared between the hosts.

The client forwarder logs, with a cryptic "unexpected error".

the zookeeper logs look normal to me:

Thanks for the info @IanGabes, did rebooting the affected instance (ie the host running the proxy) work?

@IanGabes

We have been looking into this more (thanks again for the logs) - can you share any information about your disks (and also maybe the io that things like iostat and docker stats are reporting?)

Our current working theory is something like:

  • The system IO is overloaded (see the zookeeper logs for warnings like "sync took 2500ms") - this might be because of the co-located elasticsearch nodes, or because the disk is underpowered
  • as a result zookeeper on overloaded nodes isn't able to respond to clients on the same node, like the proxy, within the 10s timeout
  • other services (like the constructor) are connecting to different zookeeper nodes hence are able to continue functioning

Our recommended architecture is to run zookeeper in production on separate nodes to the allocator for this sort of reason. While evaluating it is OK to run them on the same nodes, provided there is sufficient IO.

Alex

Alex, Sorry for the late reply. We had set up the ECE software as a trial run for an event we were running as a bit of a testing ground to see if ECE would be a good fit to replace our current production cluster environments. Unfortunately, due to the timing of the problems I was experiencing, I had to decommission the allocators and re-provision for a managed ES cluster deployment. I won't be able to properly get you the diagnostic information to help us out : (

The disks I was running the allocators on was on our "slow" storage 7200rpm disks, in a RAID-6 like configuration.

Thanks again for your help, I hope to get more bandwidth soon to test ECE out again, sorry for the disappointing troubleshooting session.

@IanGabes

Not at all - thanks for the all the help you did provide, and sorry you ran into issues

I think we ended up deciding internally that your issue was probably due to co-locating the director role (ie Zookeeper) and other bandwidth intensive services (ie Elasticsearch) - we're updating our recommended architecture to reflect this ... we were able to reproduce similar issues on (slower) spinning disks (or IOPS-limited SSDs)

Please do reach out here or via any Elastic rep/contacts you might have next time you have a chance to play with ECE!

Alex

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.