Fetching runners failed

Hi,
After installing ECE beta2 and adding servers to the configuration, I start to get timeouts on the Regions -> Runners tab saying Fetching runners failed. Initially they come in, then every other refresh of the page displays the runners. Now after a day of running, no runners are consistently shown, with the orange box "Fetching runners failed".

I'm running this at AWS on EC2.

Thanks,
Tim

Tim,

Before we look into this further: Have you tried simply reloading the Cloud UI in your browser? (I wasn't sure if that is what you meant by "refresh"? There's no refresh option on the Runners tab, but it's possible you meant switching to another tab and then back to the Runner tab.) The only time when I have seen this behaviour the login had expired. Reloading the page takes you to the login screen and then the Cloud UI should behave as expected.

If this doesn't address your issue, I can ask one of the UI developers to take a look at this thread.

Nik

Hi Nik,

Reloading the browser doesn't seem to help or relogging in either.
I believe there is some external call on the piece of the page that displays the runners. I can see this refresh because it's slow. That is the runners shown disappears and then comes back. After I have first built the ECE system with the initial coordinator this isn't really a problem. When I add more nodes, you start noticing the refreshing issue. And some of the refreshes start to fail.

I'll be out of the office the rest of this week and won't return until Thursday.
Thanks,
Tim

Thank you for clarifying, Tim! I'll ask someone from our UI team to take a look at this thread.

Nik

Hi @tarp

Sorry that you are running into these issues.

Can you bring up your browser's network activity window and see what the request is that is taking the time?

The information displayed by the runners page should be backed by the 4GB ES cluster that is created when the build occurs, so issues like this can occur if it becomes very overloaded or is otherwise distressed

Many thanks for any time you can spend investigating this

Alex

I initially start out with 3 directors in my setup. The initial 1st one of these is assigned every role. When I then assign the other 2 director instances the director and coordinator roles then this problem starts. In my browser I'm seeing this in my console output.
bundle.89b36cce7c06795e7969.js:6 POST https://elasticcloud.devl.us.i01.c01.johndeerecloud.com:12443/api/v0.1/elasticsearch/allocators-ece-region/_search 500 (Internal Server Error)

Sometimes it returns ok and sometimes it errors out like above.

--Tim

@tarp

Apologies, I only just noticed this

If you get an internal service error, there should be a corresponding error in /mnt/data/elastic/192.168.44.10/services/admin-console/logs/adminconsole.log (your allocator IP might be different to 192.168.44.10)

I'm guessing it will prove to be a timeout and caused by the "genesis node" being overloaded. In general for production use we recommend moving all clusters off the co-ordinator nodes onto allocator nodes. You could also try increasing the capacity of the admin cluster (again, asssuming the error is a timeout)

It's strange though, I wouldn't expect the activity to be so high so quickly. What sort of specs are the machines?

Alex

It turned out that I didn't have a port open on my initial server used to build ECE.