We are encountering a consistently high CPU usage (around 80% on average) on our hot nodes (2 hot, 2 warm, 4 cold) since around 3 weeks. The CPU usage roughly doubled one day and has since not gone down again. So far, we have failed to identify a clear culprit.
We have enabled logsdb on many of our indices around that time, so one idea is that the hot nodes are more busy indexing the incoming data, though the increase of around 100% CPU does not match the 10-20% that Elastic mentions.
I'm basically asking for ideas how you would try to diagnose such an increase in CPU usage. As I said, I'm not aware of any fundamental changes that have happened. Also, I'm reluctant to restart the service on the hot node, since the other one might not be able to handle the load alone.
We'll open up a support case in parallel, but the input here is often just as good
What are the specs of your hot nodes? Number of CPU, RAM, Heap configured, Disk type.
Also, what is the version you are using? The 10-20% penalty is mentioned only on the documentation of version 9.X, it also mentions:
The exact impact varies by data set and Elasticsearch version.
So depending on the specs of your nodes and stack version it can have a bigger impact.
Another question, did you change the index.refresh_interval of your indices? In my experience the default value of 1s can be a performance killer, not sure why Elastic still uses it as the default, in my deployments on the cloud I change it to at least 15s on all indices, some have bigger refresh interval, like 30s or even 60s.
I was unsure if I can just adapt the index template and rollover. So there is no issue with a data stream having indices of type time-series and logsdb?
I just want to avoid creating an additional problem.
Mixed modes in a data stream is fine … (I have tested myself)
I mean you rolled it over at some point to go from standard to logsdb … but yes there should be no issues from a search / data viz / alerts etc… etc..
Sounds like you have a subscription … did you check the mapping / settings and to see if you are actually getting synthetic source?
Yes, some of our indices have index.mode = logsdb and source.mode = synthentic.
You are right that the index type change had to happen at some point, but I'm now trying to put the somewhat scattered pieces back together
I need to check if maybe the amount of ingested logs increased after all due to changes to the Elastic Defend integration policy. I'll look into that and will then update the ticket. More incoming logs would be an obvious explanation for more CPU usage.
Hi all - we haven’t found the root cause, but we can now safely say that logsdb is NOT the issue. We set all logsdb index templates back to index mode: “standard” and waited a few weeks. While the log data on our hot nodes grew in size (as expected due to the reduced compression) by roughly 50%, the CPU usage didn’t increase or decrease noticeably.
@leandrojmp Is implying to look at the Ingest Pipelines there is a dashboard for that with the built in elastic agent monitoring with details (bad ingest pipeline can soak up CPU)
What does the thread pool look like GET _cat/thread_pool
Yeah, I've had a similar issue on an Elastic Cloud deployment where the ingest pipelines for Fortigate and Palo Alto were impacting the CPU usage on hot nodes.
In my case the only solution was to offload this to Logstash using the elastic_integration filter, after this both the CPU usage and the delay we had were fixed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.