Heavy CPU usage in APM Agents when APM-servers goes down

Hi!

Yesterday i saw in one of my testing environment that agents was consuming a lot of CPU because they was trying to reconnect to APM Server (was down)

Is it possible configure reconnection options to avoid that heavy load?

Thanks!

PD: Java agent v1.3.0

Hi and thanks a lot for the report.

Do you have any more details like the frequency of reconnections, where the CPU usage exactly stems from and the agent logs? The Java agent uses an increasing backoff so that it only tries to reconnect every 36 seconds after a while.

Thanks,
Felix

image

This was the CPU graphic. I'm finding logs...

I'm trying to reproduce a similar scenario.

PD: I see now a different error type. Yesterday i saw

[apm-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: connect timed out

And today (CPU is OK)

ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type TRANSACTION with this error: connect timed out

Reconnection time is 36s like you said.

Are you certain that the high CPU usage was caused by the agent? Or could the cause be somewhere else, maybe because not only the APM Server had an outage.

I just ran some benchmarks without an APM Server and I couldn't reproduce the high CPU usage.

I can't reproduce the same behaviour, now CPU is fine. Yesterdat APM server was crashing because of Elasticsearch was crashing with the "No shards available or All shards failed" Error.

I'm trying to get same behaviour. When APM Server recovered instantaneously CPU load went down.

Actually I can't do it. Maybe was some side-effect. If i get the same CPU load i will report you.

Thanks for reply and sorry for disturb!

No worries! Again, thanks for the feedback.

Could the high CPU usage be caused by the Elasticsearch or APM Server pods, rather than your application/the agent?

Could the high CPU usage be caused by the Elasticsearch or APM Server pods, rather than your application/the agent?

Pods with heavy CPU load was our java micro-services (they have the agent). It could also be interesting to say that AWS CPU credits was 0 (t2 unlimited was enabled.)

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.