Java service not recovering from APM circuit breaker after CPU stress subsides in kubernetes

Description:

I'm using Elastic APM to monitor a Java application deployed on a Kubernetes pod. When the service initially experiences CPU stress, the APM TRACER correctly switches to the PAUSED state due to the circuit breaker. However, even after the CPU load falls below the configured threshold, the TRACER remains in the PAUSED state, and traces and metrics are not captured.

Expected Behavior:

The APM tracer should automatically switch back to the RUNNING state once the CPU load returns to normal levels. This would allow APM to resume capturing traces for the service.

Current Behavior:

  • The CPU load initially spikes and triggers the circuit breaker, pausing the tracer.
  • The CPU load subsequently returns to normal levels, but the tracer remains paused.
  • Traces are not captured while the tracer is paused.

Logs:

 2024-01-30 14:37:00,511 [elastic-apm-circuit-breaker] INFO co.elastic.apm.agent.impl.circuitbreaker.CircuitBreaker - Stress detected by co.elastic.apm.agent.impl.circuitbreaker.SystemCpuStressMonitor: Latest system CPU load value measured is 1.0. This is the 20th consecutive measurement that crossed the configured stress threshold - 0.95, which indicates this host is under CPU stress. 2024-01-30 14:37:00,518 [elastic-apm-circuit-breaker] INFO co.elastic.apm.agent.impl.ElasticApmTracer - Tracer switched to PAUSED state

Environment:

  • Kibana version: 8.8.1
  • Elasticsearch version: 8.8.0
  • APM Server version: 8.8.3
  • APM Agent language and version: (Please specify language and version)

Additional Information:


as seen in picture cpu stress subsides

  • stress_monitor_gc_relief_threshold = 0.9
  • stress_monitor_gc_stress_threshold = 0.95

Why is the APM tracer not automatically switching back to the RUNNING state in k8 after the CPU stress subsides? Is there any additional configuration or troubleshooting steps I can take to address this issue?

Hi @HemdeepSaini ,

The circuit breaker is expected to return to a normal state after the load spike has passed.

In order to investigate this you should increase the agent log level with log_level=debug or even log_level=trace to see what is happening.

The related log messages you will see in the logs will be from the CircuitBreaker class

Also, the agent uses polling, thus there is a delay before it should detect that the load is now lower.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.