Java agent causes CPU spikes

After running more performance tests and subjecting our services with real-life traffic load profiles, we notice that our services are experiencing high CPU spikes with the Java agent enabled as opposed to our baseline with no agent. We tested with either Tomcat and Jetty and the same behaviour is observed.
As an example, our baseline 20% CPU jumps to 50% with agent and a baseline 2% jumps to around 20% with the agent.

Our profiling shows that the problem may be with the apm-reporter (and not instrumentation itself) and that the bulk of the CPU time is consumed by LMAX Disruptor on LiteBlockingWaitStrategy.waitFor()

I've attached a screenshot of a sample profile:

Elastic APM Stack 6.5.4
Java agent 1.30
We tried on both Jetty and Tomcat, same results

Is there something we can do to fix this or is this expected?

We came across this, but we are not sure if this is related to our issue and what our best course of action will be: https://github.com/LMAX-Exchange/disruptor/blob/master/src/main/java/com/lmax/disruptor/PhasedBackoffWaitStrategy.java

Hi and thanks for reporting!
The agent is indeed very lightweight on instrumentation, but if it sends all full traces to the APM server- this can be CPU intensive.
For high production loads, it is advisable to tune the sample rate to fit your needs.
I hope this helps,
Eyal.

Hi Eyal, we actually lowered samoling to 20% before getting the nunbers above. We're not too keen to go much lower. Higher than 20% and our CPU spikes are even worse. Thanks!

What loads are we talking about? Note that even if you set sample_rate to 0.01, we still collect timings for all transactions, we just don't collect and send the full data. 0.01 means you will have full traces for 1% of Transactions, which may be more than enough for high loads. If your loads are really high, I suggest you try it out.
We'll be very happy to get any numbers of which loads you are testing and what is roughly the overhead in each sample rate, and mostly- whether working with very low sample rate provides good results.
There are other tuning options that can be valid, depending on your application, you can go through the documentation of the available configuration options but I would exhaust the sample rate option before going there.
I hope this makes sense and will be glad to get more input,
Eyal.

The problem does not seem to be the actual serialization and compressing but rather waiting for elements to become available in the disruptor. That’s interesting :thinking:

We are actually already using the PhasedBackoffWaitStrategy. We’ll discuss and test other WaitStrategy options and get back to you.

I’d also be interested in the load in that scenario.

We'll continue running the load tests with even lower sampling rates to see the impact.

From what I remember, we have around 150 calls a minute on that service so we should be theoretically reporting on just 30 calls a minute average, 1 every two seconds which should be low. I'll get more accurate numbers when I get back to office tomorrow.

As a point of comparison, we did not see similar CPU loads when we were using a different agent (AppDynamics), and that was at 100% sampling.

Many thanks, Eyal and Felix.

Again, I don't think sampling is the problem here. Using a wait strategy which is easier on CPU should resolve that issue. I have created an issue for this: https://github.com/elastic/apm-agent-java/issues/443

Which exact Java version are you using and what's your OS?

Hi Felix, Java and OS details are as follows:

java version "1.8.0_74"
Java(TM) SE Runtime Environment (build 1.8.0_74-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)

CentOS release 6.10 (Final)

Some additional numbers just to provide context: for the services that saw CPU use from 20% to 50%, traffic was hitting about 4,000 calls per minute sustained but spread over four nodes. There was an hour ramp up from 0 to 4000 cpm, sustained 4000 cpm for two hours, then slow ramp down to 0 for another hour. Traffic was evenly split, so average sustained is 1000 cpm/node.

With agent on, each node CPU behaviour was slow rise from 0% to 50%, sustained to 50% then lowering down to 0%, corresponding to the cpm count.

With agent off, each node CPU behaviour was slow rise from 0% to 20%, sustained to 20% then lowering down to 0%, corresponding to the cpm count.

Hi, could you try out these changes: https://github.com/elastic/apm-agent-java/pull/446#issuecomment-457269647

This should reduce the CPU consumption considerably.

Thanks Felix! We'll get the team to run the tests and we'll get back to you as soon as we get the numbers.

Hi all, we ran an initial load test and positive results are very evident with the fix.

Before the fix:

After the fix:

So with the new wait strategy, CPU usage is basically the same for services with the agent and without the agent, meaning there is essentially no impact on the CPU by the agents. Peak CPU have sustained 1000 calls per minute per node.

Great job and many thanks to Felix and the Elastic Team! You guys rock!

3 Likes

That's great to hear!

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.