Java service not recovering from APM circuit breaker after CPU stress subsides in kubernetes

HemdeepSaini · January 30, 2024, 12:00pm

Description:

I'm using Elastic APM to monitor a Java application deployed on a Kubernetes pod. When the service initially experiences CPU stress, the APM TRACER correctly switches to the PAUSED state due to the circuit breaker. However, even after the CPU load falls below the configured threshold, the TRACER remains in the PAUSED state, and traces and metrics are not captured.

Expected Behavior:

The APM tracer should automatically switch back to the RUNNING state once the CPU load returns to normal levels. This would allow APM to resume capturing traces for the service.

Current Behavior:

The CPU load initially spikes and triggers the circuit breaker, pausing the tracer.
The CPU load subsequently returns to normal levels, but the tracer remains paused.
Traces are not captured while the tracer is paused.

Logs:

 2024-01-30 14:37:00,511 [elastic-apm-circuit-breaker] INFO co.elastic.apm.agent.impl.circuitbreaker.CircuitBreaker - Stress detected by co.elastic.apm.agent.impl.circuitbreaker.SystemCpuStressMonitor: Latest system CPU load value measured is 1.0. This is the 20th consecutive measurement that crossed the configured stress threshold - 0.95, which indicates this host is under CPU stress. 2024-01-30 14:37:00,518 [elastic-apm-circuit-breaker] INFO co.elastic.apm.agent.impl.ElasticApmTracer - Tracer switched to PAUSED state

Environment:

Kibana version: 8.8.1
Elasticsearch version: 8.8.0
APM Server version: 8.8.3
APM Agent language and version: (Please specify language and version)

Additional Information:

as seen in picture cpu stress subsides

stress_monitor_gc_relief_threshold = 0.9
stress_monitor_gc_stress_threshold = 0.95

Why is the APM tracer not automatically switching back to the RUNNING state in k8 after the CPU stress subsides? Is there any additional configuration or troubleshooting steps I can take to address this issue?

Sylvain_Juge · January 30, 2024, 1:01pm

Hi @HemdeepSaini ,

The circuit breaker is expected to return to a normal state after the load spike has passed.

In order to investigate this you should increase the agent log level with log_level=debug or even log_level=trace to see what is happening.

The related log messages you will see in the logs will be from the CircuitBreaker class

Also, the agent uses polling, thus there is a delay before it should detect that the load is now lower.

system · February 27, 2024, 1:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Heavy CPU usage in APM Agents when APM-servers goes down APM	7	1070	February 5, 2019
APM service is unable to record backend process APM java	3	192	February 2, 2024
Java APM Agent, System CPU reporting with Java 21 APM docker , java	5	591	February 13, 2024
APM Java agent from a Spring Boot app without web context APM java	5	3102	June 8, 2020
Breakdown chart is missing in the trace sample APM java	5	331	March 28, 2023

Java service not recovering from APM circuit breaker after CPU stress subsides in kubernetes

Related topics