Elastic APM shows 100% CPU usage but the cause is unclear – need help!

I am experiencing an issue with Elastic APM where the system.cpu.total.norm.pct metric for a specific service is always 1, indicating that CPU usage is constantly at 100%,

However, when I check inside the container, I do not see the same behavior.

I've configured the Java Agent in the container as follows:

java -DXms1024M -Xmx1024M                  
                 -XX:+UseParallelGC -Xlog:gc:logs/gc.log
                 -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
                 -javaagent:/opt/app/bin/elastic-apm-agent-1.47.1.jar
                 -Delastic.apm.service_name={backend-service}
                 -Delastic.apm.secret_token={secret_token}
                 -Delastic.apm.server_url=https://localhost:8200
                 -Delastic.apm.transaction_sample_rate=0.1
                 -Delastic.apm.cpu_count=2

Does anyone know what could be causing this issue?

Hi !

There is probably an issue with how the metric is captured by the agent here.

Before we dig any further are also a few unexpected things in your configuration that would be nice to fix:

  • -Delastic.apm.cpu_count=2 : I don't know this config option, do you know where it comes from ?
  • -Delastic.apm.server_url=https://localhost:8200 will make the agent send data to localhost which means unless there is an apm-server running within the container it's quite unlikely to send data anywhere.

Yes, I couldn’t find the parameter -Delastic.apm.cpu_count=2 in Elastic either. I think the issue is that the Elastic Java Agent has misinterpreted the number of cores on my host machine, causing the CPU percentage to always appear high. So, I asked ChatGPT about how to configure it :D, and it provided me with that parameter—most likely, it just made it up.

Besides that, regarding your second point, I have correctly configured the data to be sent to my Elastic APM server, as you can see in the screenshot below:

The system.cpu.total.norm.pct metric is captured from the JMX interface OperatingSystemMXBean.getSystemCpuLoad , as I see that you have configured settings to connect to the remote JMX interface, can you try to see what is the actual value of this MBean attribute with a tool like jconsole or visualvm ?

If the value read through JMX is consistent with what the agent reports, then it means the problem is in the JVM, what version are you using ? There has been quite a few bugs in this area, in particular when running in containers.

I checked the CPU usage through JConsole, and it appears to be at a normal level, which is different from what Elastic shows (always 100% CPU usage). What could be causing this issue?

Hi,

Maybe you could try updating to the latest agent version to see if that makes a difference here. If that's still an issue we will need to investigate further.

1 Like

I have tried upgrading the Elastic APM agent to version 1.52.2, but the CPU usage is still always at 100%...

I see that the CPULoad metric in MBean checked from JConsole is always at 1. May I ask if the Elastic Agent is using this metric?

That's definitely unexpected, the CpuLoad shown here is always at exactly 1.0000, which is definitely something off as it should be fluctuating a bit but not appear like a constant.

However, the agent only uses SystemCpuLoad so the CpuLoad attribute is not used.

What is the exact version and provider of the JVM here ? It's still possible it could be a JVM bug, so to rule this out I would also recommend updating to the latest available to see if that makes a difference.