I have installed ECE on a few machines, and have set up 2 clusters, each with a seperate kibana instance. After about a week of normal operation, I attempted to access one of the kibana dashboards, to received an "internal server error" message, much like this issue. All the kibana instances across my clusters had this problem.
I then attempted to restart the Kibana isntance, which didn't work. I then attempted to delete and recreate the kibana instance, which changed the error message to "unable to connect to server".
I ran some of the commands requested for diagnosis in that issue, have have put them here:
To be clear, the ES endpoints are still functional.
@IanGabes Thanks for the attached diagnostic ... (UPDATE: this isn't actually your issue, leaving this up as a useful reference for people who do run out of space) the problem is that you have run out of disk space on your admin console cluster, causing queries to that cluster to fail, which the UI depends on.
If you have space on the allocator, you can use the API to up the capacity:
get /clusters/elasticsearch/ID/plan/current
save the json body in a text file
edit it to double the capacity
post that to /clusters/elasticsearch/ID/plan
If you don't have capacity there's a few options:
you can edit the "secret advanced" metadata to up the disk:ram ratio, there's another discuss post in which I describe how to do this, I'm on my phone so I'll link it in a separate post
you can shut some other clusters down first post to /clusters/elasticsearch/ID/_shutdown .. note this deletes all the data forever unless you have snapshots enabled
Can you clarify - you are hitting the kibana endpoint directly (ie over the ECE proxy), not via the ECE UI, and it returned first a 500 and then (after you deleted and recreated the kibana via the ECE UI) failed to connect altogether?
Can you look at the proxy logs and see if you are getting errors? Eg /mnt/data/elastic/RUNNERIP/services/proxy/logs/proxy.log
It's very strange that kibana isn't working but elasticsearch is though, the proxy treats them identically. When you restarted/stopped/recreated the kibana, did you get any errors reported?
I would check that you have disk space available on all the hosts.
The issue definitely feels like something is running out of disk (either the host, or some of the kibana containers, or both)
My ECE UI still works, without problems. (I guess my eyes lit up when I saw 500 error in the title).
I am hitting the kibana proxy endpoint directly. I was first given a 500. After a restart, then a delete the it return a "could not find cluster". I have continually logged to ES before and after I have been working on these problems.
I received no errors during the recreation of the kibana instance.
I have 2TB disk space left across my allocators.
My proxy logs look like this, looks okay to my untrained eye:
I have gone ahead and reshuffled some resources around to give the admin cluster room to breathe.
Also, i noticed some errors on the proxy page of the ECE UI, but this just looks like some :
Here is an except from the json response (when I query the below enpoint manually)
Now that I think we found the problem I am not sure how to fix it : ( This doesn't make much sense to me because the ES endpoints continue to function:
Sorry you're having problems. Thanks again for the detailed information you shared, which has really helped our diagnosis.
I grabbed a few people to discuss this case last night. It sounds a bit like an issue we've just started tracking, where the applications lose their connection to Zookeeper (the database of record).
(These might not be exactly right, apologies; I'm not in the office) and look for any interesting timeouts or error messages?
Can you share a bit more about your architecture, eg what OS and kernel versions are you running, what is the underlying hardware, how many zookeeper processes (director roles in the UI) are you running, and is that on the allocators or one standalone machines?
In terms of workarounds, it should be possible to fix by restarting the machine running the proxy (the underlying issue I believe is in the client-forwarder service, but restarting that may also require restarting other services)
I am running allocators on three virtual hosts, each with 128GB of ram on the same network. We are running Ubuntu 16.04, kernel version 4.4.0-96, and docker version 1.12. I have 4TB of disk space shared between the hosts.
The client forwarder logs, with a cryptic "unexpected error".
We have been looking into this more (thanks again for the logs) - can you share any information about your disks (and also maybe the io that things like iostat and docker stats are reporting?)
Our current working theory is something like:
The system IO is overloaded (see the zookeeper logs for warnings like "sync took 2500ms") - this might be because of the co-located elasticsearch nodes, or because the disk is underpowered
as a result zookeeper on overloaded nodes isn't able to respond to clients on the same node, like the proxy, within the 10s timeout
other services (like the constructor) are connecting to different zookeeper nodes hence are able to continue functioning
Our recommended architecture is to run zookeeper in production on separate nodes to the allocator for this sort of reason. While evaluating it is OK to run them on the same nodes, provided there is sufficient IO.
Alex, Sorry for the late reply. We had set up the ECE software as a trial run for an event we were running as a bit of a testing ground to see if ECE would be a good fit to replace our current production cluster environments. Unfortunately, due to the timing of the problems I was experiencing, I had to decommission the allocators and re-provision for a managed ES cluster deployment. I won't be able to properly get you the diagnostic information to help us out : (
The disks I was running the allocators on was on our "slow" storage 7200rpm disks, in a RAID-6 like configuration.
Thanks again for your help, I hope to get more bandwidth soon to test ECE out again, sorry for the disappointing troubleshooting session.
Not at all - thanks for the all the help you did provide, and sorry you ran into issues
I think we ended up deciding internally that your issue was probably due to co-locating the director role (ie Zookeeper) and other bandwidth intensive services (ie Elasticsearch) - we're updating our recommended architecture to reflect this ... we were able to reproduce similar issues on (slower) spinning disks (or IOPS-limited SSDs)
Please do reach out here or via any Elastic rep/contacts you might have next time you have a chance to play with ECE!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.