Do you have any more details like the frequency of reconnections, where the CPU usage exactly stems from and the agent logs? The Java agent uses an increasing backoff so that it only tries to reconnect every 36 seconds after a while.
PD: I see now a different error type. Yesterday i saw
[apm-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: connect timed out
And today (CPU is OK)
ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type TRANSACTION with this error: connect timed out
Are you certain that the high CPU usage was caused by the agent? Or could the cause be somewhere else, maybe because not only the APM Server had an outage.
I just ran some benchmarks without an APM Server and I couldn't reproduce the high CPU usage.
I can't reproduce the same behaviour, now CPU is fine. Yesterdat APM server was crashing because of Elasticsearch was crashing with the "No shards available or All shards failed" Error.
I'm trying to get same behaviour. When APM Server recovered instantaneously CPU load went down.
Actually I can't do it. Maybe was some side-effect. If i get the same CPU load i will report you.
Could the high CPU usage be caused by the Elasticsearch or APM Server pods, rather than your application/the agent?
Pods with heavy CPU load was our java micro-services (they have the agent). It could also be interesting to say that AWS CPU credits was 0 (t2 unlimited was enabled.)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.