Java agent make Micronaut to hang

Versions:
Micronaut 3.7 running in a Docker container.
JDK 17 Temurin
Java Agent 1.34.0

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
During initialization of the application (loading data) and also under heavy load during normal operation, the java agent seems to lock up Micronaut to respond slowly on incoming requests.
The health check of the AWS ALB as a normal health check timeout of 5 seconds but when enabling the java agent I have to extend the health check timeout to 30 seconds otherwise AWS ECS Fargate will restart the container because of the failing health check.

This is also happening in applications that load and process lots of data during initial initialization in the startup and makes the Micronaut completely unresponsive during the initialization phase and we need to set graceful health check period to a long enough period to get the application up and running before health check starts.

Without the Java Agent added to the application the health check period and graceful startup is not needed and the application starts as expected, also it will manage to handle peak traffic without stopping responding on health check.

This will happen even if configuring the agent with

  • elastic.apm.recording to false
  • elastic.apm.enabled to false

Removing the agent from the JVM startup parameters removes the problem completely.
Extending the health check timeout to 20-30 seconds, and also add graceful health check startup period to handle the startup initialization make the app start with the agent but the periods seem to be extremely exaggerated and should not be needed.

Hi Daniel, welcome to our forum :wave:

Take a look at this caveats section of AWS Lambda tracing, it explains why startup can be affected by instrumentation and proposes ways to overcome that.

I hope it helps.
Please let us know if and how you sorted this out, it may be useful for other users as well.

Hi @Eyal_Koren!
The only workaround of the problem right now I have seen any effect on stability is adjusting the health check parameters in AWS ECS.

  • wait 30 seconds before starting checking health.
  • increase timeout for health check response to longer than 10 seconds.

it's a bit strange, because the Elastic APM says that the longest health check latency is 2 seconds, and most is much shorter like 5-10ms.

I'm now testing to disable most instrumentations and running again with default values of health check settings and also testing using Micronaut /health/liveness endpoint for health check instead of just /health to check if that general endpoint triggers some instrumentation that slows the app down for some weird reason when calculating health.

I am not sure what you are referring to, please add a link. Could it be about the health check that the agent does with the APM Server?

I'm now testing to disable most instrumentations

Try to follow the instructions I referred you to. You can use the agent logs to compile the minimal list of instrumentations to enable.

and running again with default values of health check settings and also testing using Micronaut /health/liveness endpoint for health check instead of just /health to check if that general endpoint triggers some instrumentation that slows the app down for some weird reason when calculating health.

It shouldn't trigger instrumentation after the first invocation. If just random endpoint invocations experience unusual latencies, this needs to be analyzed further. If you have a test env where you can reproduce it, try setting log_level=debug and see if the log provides hints.

I am not sure what you are referring to, please add a link. Could it be about the health check that the agent does with the APM Server?

Sorry I was unclear, I was talking about the /health check endpoint of my Micronaut based webapplication that is traced by Elastic APM agent. When the health check if is not responding fast enough, the docker container is restarted by the container orchestrator (AWS ECS) as the liveness check failes. Now we have set the timeout up from 5 seconds to 20-30 seconds to avoid restarts.

Try to follow the instructions I referred you to. You can use the agent logs to compile the minimal list of instrumentations to enable.

I have done that and have enabled only "annotations-capture-transaction". I will have see if it has any effect tomorrow (the problem is mostly affecting the application during a nightly transfer)

It shouldn't trigger instrumentation after the first invocation. If just random endpoint invocations experience unusual latencies, this needs to be analyzed further. If you have a test env where you can reproduce it, try setting log_level=debug and see if the log provides hints.

I will enabled debug logging in our dev environment to see if it logs anything interesting.
The problem of slow startup is explainable by initial instrumentation but it seems like during heavy load under normal operations it also make the web application slow/unresponsive.

Ahh, so your instrumentation relies solely on our annotation API? Nothing else gets traced other than what you annotated with @CaptureTransaction?
In this case, I would look into two additional things:

  1. make sure you configure application_packages to capture only the minimal root package/s to capture everything you want to instrument
  2. be cautious in what you annotate with our public API annotations. If, for example, you add an annotation on a method that is being executed very high number of times during a transaction, this may be related

Let us know if you get a debug log and you can't make sense out of it.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.