High CPU (cgroup) usage/utilization

Hi.

We have only one node in our cluster (hosted on Elastic cloud, v7.10.0) with Kibana (for filebeat and metricbeat) and everything was working fine until about last week. I'm pretty new to all this, but I've set up alerts for high CPU usage and it's been tripping since the previous week and I have no idea why. The load on our instance shouldn't have changed since the amount of data coming in should be quite constant and shouldn't have changed in at least a month. I've also checked hot threads and long running tasks but there's nothing unusual there either. The only thing that is odd is that Cgroup CPU utilization is above 85% threshold some of the time, but even those patterns are odd.

Here are some graphs for the last 24 hours:


As can be seen on the graphs above, Cgroup CPU utilization jumped up yesterday at around 21:30 and stayed up since. However the non-cgroup CPU utilization seems fine (doesn't exceed 6%).

And a close-up graph for the last 2 hours:

I don't see Kibana being a reason for this as can be seen here - last 24 hours:

Kibana for the last 2 hours:

Here are also some other performance graphs (for the last 24 hours):

And our instances health status:

I've already restarted the cluster with no avail. I've even stopped shipping logs (to the same instance, since we only have one) which seemed to help for a day or so but now high CPU usage is back.

Can anyone help me debug what's causing this? If there's a "rogue" client sending too much data or anything like that, how can I find out what client is doing it?

Ok, so we have about 5 "clients" (filebeat and metricbeat) that are sending data to Elastic cloud. I've turned off all of those clients so no new data was coming to the cloud for the last 4 days. And for whatever reason, I've lost all data but for the last 2 days, even though there was no activity on the node for the last 4 days (wasn't restarted etc...):

Not only that, something is still causing 50% CPU usage for whatever reason:

... even though really not much is happening on the instance:

... and the instance config hasn't changed for the last 13 days:

Any kind of help or insight would be much appreciated.

I have the same issues exactly... did you found something?

I'm still debugging the issue, but what I've done so far is to disable all our metricbeat and filebeat clients, as well as node logs and metrics. With all that disabled, the CPU utilization dropped to around 10% which is not 0%, but some utilization is expected from a running node. I was still getting around 100 index and search requests per 5 minutes and I have no idea where they are coming from (internal stuff?).

If I enabled just node metrics, CPU utilization jumped to around 20-25% (on cluster with just 1 node, 4 GB RAM and 2 GB RAM for Kibana) which seems very high for just collecting some metrics.
If I enabled just node logs, CPU utilization seemed unimpacted (stayed at around 10%).

So I decided to enable node logs and enable 4 of the 6 clients (the last 1 or 2 clients, that weren't enabled yet, produce the most logs, but the same amount of metrics as the others) and the CPU utilization seemed unimpacted (stayed at around 10%).

Then I decided to enable the last 2 clients, so I had back online all the clients but no node metrics (Stack Monitoring). CPU utilization at first (in the morning, when I enabled the last clients) seemed unimpacted, but at around 18:00 it gradually increased to around 20-25%. Search requests stayed at around 100 per 5 minutes, but index requests increased to around 400 per 5 minutes.

These are the current performance graphs (running all but the Stack Monitoring):

Before the upgrade to Elastic 7.10.0 we were paying $36/month (for the same 6 clients setup) for one of the cheapest deploys (2 GB RAM for Elastic and 1 GB for Kibana). That setup had a constant memory pressure at around 75% or even higher sometimes and seemed that it needs some more resources in order to ensure stable operation. That's why I've increased resources 2x on Elastic and Kibana side (4GB RAM for Elastic, 2 GB for Kibana) which now costs us 3x as much at $110/month which is a significant cost increase for such a small deploy.
The deploy was running fine after that (on 7.6.2) but I needed to enable stack monitoring alerts and upgrade to 7.10 seemed like a good solution. At first after the upgrade all was fine, but after a few days or even weeks (don't really remember exactly) I started receiving high CPU utilization alerts without any change on the Elastic/Kibana or our clients side. And after contacting Elastic support, they didn't know how to solve our problem except for allocating even more resources to our cluster which could easily increase our cost by another 2x to about $220/month just to "fix" the problem?

The next step now is to enable additional logging on our clients which will put more pressure on the node/cluster and I need to see how CPU Usage and Memory Pressure behave after that, before I try to (re)enable Stack Monitoring to see what happens to it this time.
I'm also tweaking Index Templates and ILMs in the mean time to see if that has any impact on the node performance.

Oh, and regarding having only 3 days of metrics in my previous post. It turns out that something was keeping only the last 3 metrics' daily indices, but I have no idea what was deleting the older ones, since these indices weren't linked to any ILMs.

You seem to be saying that the CPU usage number is higher than some threshold, but I don't understand how that is manifesting as an actual problem. Can you explain why this matters? Why did you set up alerts on high CPU usage in the first place, and why did you choose the thresholds you chose for those alerts?

The problem of high CPU (at extended times over 100%) is that the node's response times increase significantly. CPU load has increased in that time so much, that we've used all the CPU Credits and were frequently running at 90-100% CPU which even made Kibana slow to respond/browse.

The reason for setting up monitoring alerts was simply to be alerted if there is anything out of the ordinary happening with the node/cluster (which I was soon notified of). I didn't set just the high CPU usage alerts, but others as well (high disk usage etc.). Also, I didn't choose any specific thresholds for the alerts, but what seemed as general ones to me (over 90% CPU and/or disk usage, missing monitoring data...).

Would you be able to share a diagnostics dump or at least tasks and hot threads from a period where the cluster is responding slowly?

Also, approximately how much data are you ingesting into the cluster? (X GB/hour or similar).

It seems like you are aware of CPU credits. I wonder if in the past, the workload have been different such that you did not use up all the credits?

Unfortunately I don't have any such data anymore. I'd have to reproduce the case (probably with the same setup, just enabling Stack Monitoring), but I'm reluctant to do that since I'm not sure what the outcome will be and how that would impact the stability of the node.
However, I did check for hot threads and long running tasks back then and there were no obvious problems/culprits there.

I don't know how to measure the amount of data we're ingesting in the cluster since I know of no such API call (can you help me with that) and I also don't want to start the instance monitoring because I'm afraid I'll overload the instance again. I can only give you the general performance and some logs screenshots.
The current setup is that all of our 6 clients are sending logs to our node (4 of them are sending additional and constant logs now, as before, they barely sent any) and the CPU utilization is mostly around 30%, but it keeps spiking over 100% practically every hour (and I don't know what's causing it or how to find that out).

Performance for the last 24 hours:

Logs metrics for the last 7 days:

I'm not sure what happened to the CPU workload, but it wasn't so high (before the upgrade) that we'd need CPU credits for it. Then, after some time after the upgrade to v7.10.0, CPU usage got so high that we used up the CPU credits in a few hours and weren't able to get them back since our CPU usage was constantly too high.

One significant change between 7.6 and 7.10 is the use of G1 garbage collector. We did some yet unreleased improvements, but those are mostly in a 10% range. It could be interesting to see if we can correlate the spikes to GC. The ES log should indicate if a lot of time is spent on GC.

The amount of data indexed can be found in a few ways, one way is to do _stats two times with 1 minute in between, then we can diff the output.

Also, seeing tasks and hot threads from your current usage might reveal something, I should be willing to take a look even if it is not in the overload situation.

I've made the _stats logs as suggested (about a minute apart). I've also included the hot_threads and _tasks in the logs on this link: https://drive.google.com/file/d/1UDrOOy6Bj1IPnserkJNGUiY-A9whHkfZ

As for the GC spikes. I can't quite check that because some of the Performance graphs stopped working after a few days and I have no idea why. I've enabled the ML and advanced Search instances which added at least one more instance to the Performance graphs.


Hm, the graphs are back now (have been for a week) even though I didn't change any configs/loads on the server or client side.

I did some more researching and maybe found the reason for CPU spikes. It seems one of our clients sending a lot of logs in a short time. Took actions to reduce excessive logging and will see if that helps. If that helps, I might also turn Stack Monitoring back on to see how that affects the CPU utilization.

Thanks for that extra info. I did have a look at the files you sent earlier, but they did not show anything interesting since the cluster was doing nearly nothing at the time.

Look forward to hearing if this behaves well without the excessive logging.