Hi,
After installing ECE beta2 and adding servers to the configuration, I start to get timeouts on the Regions -> Runners tab saying Fetching runners failed. Initially they come in, then every other refresh of the page displays the runners. Now after a day of running, no runners are consistently shown, with the orange box "Fetching runners failed".
Before we look into this further: Have you tried simply reloading the Cloud UI in your browser? (I wasn't sure if that is what you meant by "refresh"? There's no refresh option on the Runners tab, but it's possible you meant switching to another tab and then back to the Runner tab.) The only time when I have seen this behaviour the login had expired. Reloading the page takes you to the login screen and then the Cloud UI should behave as expected.
If this doesn't address your issue, I can ask one of the UI developers to take a look at this thread.
Reloading the browser doesn't seem to help or relogging in either.
I believe there is some external call on the piece of the page that displays the runners. I can see this refresh because it's slow. That is the runners shown disappears and then comes back. After I have first built the ECE system with the initial coordinator this isn't really a problem. When I add more nodes, you start noticing the refreshing issue. And some of the refreshes start to fail.
I'll be out of the office the rest of this week and won't return until Thursday.
Thanks,
Tim
Can you bring up your browser's network activity window and see what the request is that is taking the time?
The information displayed by the runners page should be backed by the 4GB ES cluster that is created when the build occurs, so issues like this can occur if it becomes very overloaded or is otherwise distressed
Many thanks for any time you can spend investigating this
If you get an internal service error, there should be a corresponding error in /mnt/data/elastic/192.168.44.10/services/admin-console/logs/adminconsole.log (your allocator IP might be different to 192.168.44.10)
I'm guessing it will prove to be a timeout and caused by the "genesis node" being overloaded. In general for production use we recommend moving all clusters off the co-ordinator nodes onto allocator nodes. You could also try increasing the capacity of the admin cluster (again, asssuming the error is a timeout)
It's strange though, I wouldn't expect the activity to be so high so quickly. What sort of specs are the machines?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.