Consistently high CPU usage on hot nodes

Hi all

We are encountering a consistently high CPU usage (around 80% on average) on our hot nodes (2 hot, 2 warm, 4 cold) since around 3 weeks. The CPU usage roughly doubled one day and has since not gone down again. So far, we have failed to identify a clear culprit.

We have enabled logsdb on many of our indices around that time, so one idea is that the hot nodes are more busy indexing the incoming data, though the increase of around 100% CPU does not match the 10-20% that Elastic mentions.

I'm basically asking for ideas how you would try to diagnose such an increase in CPU usage. As I said, I'm not aware of any fundamental changes that have happened. Also, I'm reluctant to restart the service on the hot node, since the other one might not be able to handle the load alone.

We'll open up a support case in parallel, but the input here is often just as good :slight_smile:

What are the specs of your hot nodes? Number of CPU, RAM, Heap configured, Disk type.

Also, what is the version you are using? The 10-20% penalty is mentioned only on the documentation of version 9.X, it also mentions:

The exact impact varies by data set and Elasticsearch version.

So depending on the specs of your nodes and stack version it can have a bigger impact.

Another question, did you change the index.refresh_interval of your indices? In my experience the default value of 1s can be a performance killer, not sure why Elastic still uses it as the default, in my deployments on the cloud I change it to at least 15s on all indices, some have bigger refresh interval, like 30s or even 60s.

Hi there and thank you for the quick answer

We are on 8.17.4 and it's a self-hosted cluster.

Hot nodes (2x): 16 vCPUs, 36GB RAM, 16 GB Heap, SSD disks
Warm nodes (2x) and cold nodes (4x): 8 vCPUs, 16GB RAM, 8 GB Heap, HDD disks

We haven't touched index.refresh_interval (before or after the CPU usage increase), so I guess it will default to 1s for all indices.

The JVM usage also increased a bit from +- 55% to 65% on the two hot nodes.

In addition to what @leandrojmp said the actual logs can make a difference… (not often but they can)

What was the CPU before / what is the increase?

What kinda of data / logs?

What is the volume, avg doc size?

What are the mappings / how complex are the documents?

There are cases where logsdb may take more resources….

You can certainly turn it off rollover and see if the CPU reverts.

I was unsure if I can just adapt the index template and rollover. So there is no issue with a data stream having indices of type time-series and logsdb?

I just want to avoid creating an additional problem.

Good question I should have said that

Mixed modes in a data stream is fine … (I have tested myself)

I mean you rolled it over at some point to go from standard to logsdb … but yes there should be no issues from a search / data viz / alerts etc… etc..

Sounds like you have a subscription … did you check the mapping / settings and to see if you are actually getting synthetic source?

        "source": {
            "mode": "SYNTHETIC"
          }

Yes, some of our indices have index.mode = logsdb and source.mode = synthentic.

You are right that the index type change had to happen at some point, but I'm now trying to put the somewhat scattered pieces back together :slight_smile:

I need to check if maybe the amount of ingested logs increased after all due to changes to the Elastic Defend integration policy. I'll look into that and will then update the ticket. More incoming logs would be an obvious explanation for more CPU usage.

1 Like

There are a few different things that we still have to follow up on. If we reach a clear verdict, I will update the ticket and share our insights.

Hi all - we haven’t found the root cause, but we can now safely say that logsdb is NOT the issue. We set all logsdb index templates back to index mode: “standard” and waited a few weeks. While the log data on our hot nodes grew in size (as expected due to the reduced compression) by roughly 50%, the CPU usage didn’t increase or decrease noticeably.

Well, the search continues :slight_smile:

What does the hot threads API show for the nodes with high CPU usage? Is there anything unexpected that stands out there?

1 Like

How are you indexing your data? Are you using Elastic Agent integrations? If true, which integrations?

Also, did you change the refresh_interval from the default of 1s?

@maario Good to hear you eliminated logsdb.

So now back to basics do you have monitoring in place?

  • Have you looked at hot thread as @Christian_Dahlqvist suggested - First Step
  • Do you have monitoring in place?
  • @leandrojmp Is implying to look at the Ingest Pipelines there is a dashboard for that with the built in elastic agent monitoring with details (bad ingest pipeline can soak up CPU)
  • What does the thread pool look like GET _cat/thread_pool

Yeah, I've had a similar issue on an Elastic Cloud deployment where the ingest pipelines for Fortigate and Palo Alto were impacting the CPU usage on hot nodes.

In my case the only solution was to offload this to Logstash using the elastic_integration filter, after this both the CPU usage and the delay we had were fixed.

But this requires an Enterprise License.

2 Likes