Java agent causes CPU spikes

digitalron · January 23, 2019, 3:32pm

After running more performance tests and subjecting our services with real-life traffic load profiles, we notice that our services are experiencing high CPU spikes with the Java agent enabled as opposed to our baseline with no agent. We tested with either Tomcat and Jetty and the same behaviour is observed.
As an example, our baseline 20% CPU jumps to 50% with agent and a baseline 2% jumps to around 20% with the agent.

Our profiling shows that the problem may be with the apm-reporter (and not instrumentation itself) and that the bulk of the CPU time is consumed by LMAX Disruptor on LiteBlockingWaitStrategy.waitFor()

I've attached a screenshot of a sample profile:

Elastic APM Stack 6.5.4
Java agent 1.30
We tried on both Jetty and Tomcat, same results

Is there something we can do to fix this or is this expected?

We came across this, but we are not sure if this is related to our issue and what our best course of action will be: https://github.com/LMAX-Exchange/disruptor/blob/master/src/main/java/com/lmax/disruptor/PhasedBackoffWaitStrategy.java

Eyal_Koren · January 23, 2019, 4:02pm

Hi and thanks for reporting!
The agent is indeed very lightweight on instrumentation, but if it sends all full traces to the APM server- this can be CPU intensive.
For high production loads, it is advisable to tune the sample rate to fit your needs.
I hope this helps,
Eyal.

digitalron · January 23, 2019, 4:17pm

Hi Eyal, we actually lowered samoling to 20% before getting the nunbers above. We're not too keen to go much lower. Higher than 20% and our CPU spikes are even worse. Thanks!

Eyal_Koren · January 23, 2019, 4:40pm

What loads are we talking about? Note that even if you set sample_rate to 0.01, we still collect timings for all transactions, we just don't collect and send the full data. 0.01 means you will have full traces for 1% of Transactions, which may be more than enough for high loads. If your loads are really high, I suggest you try it out.
We'll be very happy to get any numbers of which loads you are testing and what is roughly the overhead in each sample rate, and mostly- whether working with very low sample rate provides good results.
There are other tuning options that can be valid, depending on your application, you can go through the documentation of the available configuration options but I would exhaust the sample rate option before going there.
I hope this makes sense and will be glad to get more input,
Eyal.

felixbarny · January 23, 2019, 6:19pm

The problem does not seem to be the actual serialization and compressing but rather waiting for elements to become available in the disruptor. That’s interesting

We are actually already using the PhasedBackoffWaitStrategy. We’ll discuss and test other WaitStrategy options and get back to you.

I’d also be interested in the load in that scenario.

digitalron · January 23, 2019, 7:10pm

We'll continue running the load tests with even lower sampling rates to see the impact.

From what I remember, we have around 150 calls a minute on that service so we should be theoretically reporting on just 30 calls a minute average, 1 every two seconds which should be low. I'll get more accurate numbers when I get back to office tomorrow.

As a point of comparison, we did not see similar CPU loads when we were using a different agent (AppDynamics), and that was at 100% sampling.

Many thanks, Eyal and Felix.

felixbarny · January 24, 2019, 8:58am

Again, I don't think sampling is the problem here. Using a wait strategy which is easier on CPU should resolve that issue. I have created an issue for this: https://github.com/elastic/apm-agent-java/issues/443

felixbarny · January 24, 2019, 10:16am

Which exact Java version are you using and what's your OS?

digitalron · January 24, 2019, 11:11am

Hi Felix, Java and OS details are as follows:

java version "1.8.0_74"
Java(TM) SE Runtime Environment (build 1.8.0_74-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)

CentOS release 6.10 (Final)

Some additional numbers just to provide context: for the services that saw CPU use from 20% to 50%, traffic was hitting about 4,000 calls per minute sustained but spread over four nodes. There was an hour ramp up from 0 to 4000 cpm, sustained 4000 cpm for two hours, then slow ramp down to 0 for another hour. Traffic was evenly split, so average sustained is 1000 cpm/node.

With agent on, each node CPU behaviour was slow rise from 0% to 50%, sustained to 50% then lowering down to 0%, corresponding to the cpm count.

With agent off, each node CPU behaviour was slow rise from 0% to 20%, sustained to 20% then lowering down to 0%, corresponding to the cpm count.

felixbarny · January 24, 2019, 4:50pm

Hi, could you try out these changes: https://github.com/elastic/apm-agent-java/pull/446#issuecomment-457269647

This should reduce the CPU consumption considerably.

digitalron · January 24, 2019, 5:14pm

Thanks Felix! We'll get the team to run the tests and we'll get back to you as soon as we get the numbers.

digitalron · January 26, 2019, 4:26pm

Hi all, we ran an initial load test and positive results are very evident with the fix.

Before the fix:

After the fix:

So with the new wait strategy, CPU usage is basically the same for services with the agent and without the agent, meaning there is essentially no impact on the CPU by the agents. Peak CPU have sustained 1000 calls per minute per node.

Great job and many thanks to Felix and the Elastic Team! You guys rock!

felixbarny · January 30, 2019, 8:33am

That's great to hear!

system · February 20, 2019, 4:33am

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Java APM Agent causes CPU spikes APM java	2	816	February 5, 2021
Heavy CPU usage in APM Agents when APM-servers goes down APM	7	1070	February 5, 2019
Java Agent -100% CPU APM java	3	367	February 3, 2021
Performance degradation after update from 1.7.1 to 1.10 APM dotnet	2	449	August 9, 2021
High CPU usage with Python agent APM	4	678	June 18, 2018

Java agent causes CPU spikes

Related topics