Where should I move proxy nodes in ECE?

I just noticed Kibana was kind of slow in my current deployment, and I wasn't sure why. Maybe, I should have moved proxy nodes from coordinators to allocators. Any thoughts?

If I don't want to deploy a dedicate node/s for proxies, where should I keep proxies in the ECE deployment? Should I move them to allocators or should they stay as they are (on coordinators)?

I was under impression that nothing should have been moved into allocators for the security reasons.

--Thanks

We do recommend moving proxy nodes off allocators where possible, either onto the coordinator nodes or onto their own dedicated nodes. This is both security/isolation and performance reasons. (See https://www.elastic.co/guide/en/cloud-enterprise/current/ece-playbook.html)

That said many smaller ECE deployments do co-locate proxies on allocators.

It is unlikely that this would be the root cause of a slow Kibana unless there was some anomalous traffic flow (eg large numbers of requests that the proxy was rejecting for some reason). You can use docker stats to confirm that the proxy is not CPU-limiting the allocator box, I'd be somewhat surprised if that was the case.

Have you looked at the monitoring information (either what ECE generates natively or ideally with a separate monitoring cluster)? Most of the time a slow Kibana is caused by a slow Elasticsearch.

I haven't moved proxies back to allocators (since the installation process), I was just thinking about it since allocators are much more powerful than coordinators in my deployment (assuming in any others as well), but again I had concerns that this wasn't a recommendation by Elastic.

I checked all the monitoring information dashboards and logs, and everything was green. The usage was super low, no obvious performance issues were showing.

The issue I noticed is that sometimes (when you run through the tabs on the left side menu) Kibana shows the first few tabs right away, and then it takes about 5-10 seconds to show the next one you clicked on. And this is constantly happening. Or sometimes when you refresh, it takes about 5 seconds to do it. The other times, it's right away.

@Alex_Piggott
I ran docker stats commend and noticed frc-proxies-proxy is about 10-20% on the average, but when I keep going through the tabs in Kibana and I get the 'spinning' logo of Kibana, the CPU spikes to 80-90%. The server is pretty much idle, just some test data, and it has 8CPUs and 32GB RAM.

Also, after upgrading ECE, I can see some notification about proxies running on coordinators now.

Runner x.x.x.x is both a coordinator and a proxy

From your documentation:

Roles that should not be held by the same runner:

  • Allocators and coordinators
  • Allocators and directors
  • Coordinators and proxies

I'd look at the proxy logs in the L+M cluster and correlate response times with when you see slow down. If you see long response times and no suggestion that ES is slow then maybe a slow coordinator is acting as bottleneck.

I'd be really surprised if that could be responsible for a 10s delay. Sounds more like an IO stall on the allocator or connectivity issues or something like that.

Runner x.x.x.x is both a coordinator and a proxy

That's our "best practice" recommendation but lots of people run eg 3 or 6 host ECE deployments and co-locating either allocator/proxy or coordinator/proxy (or both) depending on hardware/expected load etc is super standard.

Sorry I missed this. Can I check I understood correctly ... the CPU usage of frc-proxies-proxy goes up to 80-90% when you spin through the tabs in Kibana?!

How many cores do the coordinator hosts have? (and what sort of volume of traffic does the proxy logs show when this occurs)?

@Alex_Piggott

We run the whole infra in GCP on SSD drives. I'll look into the proxy logs.

Coordinators: 3 systems x 8 vCPUs, 32 GB memory + SSD storage
Allocators: 3 systems x 32 vCPUs, 128 GB memory + SSD storage

And we're getting barely any data, we just started.

Searches return pretty quickly, but the issue we have is with going through the tabs (ex: Discover, Visualize, etc), they're taking 10s+ after you click on a few. Maybe that's a normal behavior, maybe these are apps they're loading a bit longer. Would that be correct?

The first time maybe? I wouldn't expect it to happen every time, and the proxy spiking in CPU is a bit odd.

The coordinator box you've spec'd looks plenty fast enough that putting the proxy either there or on the allocator shouldn't matter

I'll keep digging around and let you know if there is anything odd anywhere in the deployment.

--Thanks!