After running more performance tests and subjecting our services with real-life traffic load profiles, we notice that our services are experiencing high CPU spikes with the Java agent enabled as opposed to our baseline with no agent. We tested with either Tomcat and Jetty and the same behaviour is observed.
As an example, our baseline 20% CPU jumps to 50% with agent and a baseline 2% jumps to around 20% with the agent.
Our profiling shows that the problem may be with the apm-reporter (and not instrumentation itself) and that the bulk of the CPU time is consumed by LMAX Disruptor on LiteBlockingWaitStrategy.waitFor()
Hi and thanks for reporting!
The agent is indeed very lightweight on instrumentation, but if it sends all full traces to the APM server- this can be CPU intensive.
For high production loads, it is advisable to tune the sample rate to fit your needs.
I hope this helps,
Eyal.
Hi Eyal, we actually lowered samoling to 20% before getting the nunbers above. We're not too keen to go much lower. Higher than 20% and our CPU spikes are even worse. Thanks!
What loads are we talking about? Note that even if you set sample_rate to 0.01, we still collect timings for all transactions, we just don't collect and send the full data. 0.01 means you will have full traces for 1% of Transactions, which may be more than enough for high loads. If your loads are really high, I suggest you try it out.
We'll be very happy to get any numbers of which loads you are testing and what is roughly the overhead in each sample rate, and mostly- whether working with very low sample rate provides good results.
There are other tuning options that can be valid, depending on your application, you can go through the documentation of the available configuration options but I would exhaust the sample rate option before going there.
I hope this makes sense and will be glad to get more input,
Eyal.
The problem does not seem to be the actual serialization and compressing but rather waiting for elements to become available in the disruptor. That’s interesting
We are actually already using the PhasedBackoffWaitStrategy. We’ll discuss and test other WaitStrategy options and get back to you.
I’d also be interested in the load in that scenario.
We'll continue running the load tests with even lower sampling rates to see the impact.
From what I remember, we have around 150 calls a minute on that service so we should be theoretically reporting on just 30 calls a minute average, 1 every two seconds which should be low. I'll get more accurate numbers when I get back to office tomorrow.
As a point of comparison, we did not see similar CPU loads when we were using a different agent (AppDynamics), and that was at 100% sampling.
Again, I don't think sampling is the problem here. Using a wait strategy which is easier on CPU should resolve that issue. I have created an issue for this: https://github.com/elastic/apm-agent-java/issues/443
java version "1.8.0_74"
Java(TM) SE Runtime Environment (build 1.8.0_74-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)
CentOS release 6.10 (Final)
Some additional numbers just to provide context: for the services that saw CPU use from 20% to 50%, traffic was hitting about 4,000 calls per minute sustained but spread over four nodes. There was an hour ramp up from 0 to 4000 cpm, sustained 4000 cpm for two hours, then slow ramp down to 0 for another hour. Traffic was evenly split, so average sustained is 1000 cpm/node.
With agent on, each node CPU behaviour was slow rise from 0% to 50%, sustained to 50% then lowering down to 0%, corresponding to the cpm count.
With agent off, each node CPU behaviour was slow rise from 0% to 20%, sustained to 20% then lowering down to 0%, corresponding to the cpm count.
So with the new wait strategy, CPU usage is basically the same for services with the agent and without the agent, meaning there is essentially no impact on the CPU by the agents. Peak CPU have sustained 1000 calls per minute per node.
Great job and many thanks to Felix and the Elastic Team! You guys rock!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.