Deadlock with Java apm agent

Kibana version: 7.8.0

Elasticsearch version: 7.8.0

APM Server version: 7.8.0

APM Agent language and version:
Java
1.17.0

Fresh install or upgraded from other version?:
Fresh

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

I am fairly new to elastic APM. Wanted to check it out. I am trying it on my developer env that is running docker-for-windows. I quickly setup the elasticsearch, kibana and apm-server on my local box (all using version 7.8.0) using the docker compose from https://www.elastic.co/guide/en/apm/get-started/current/quick-start-overview.html. The docker containers came up fine. I’m trying this on a Java application (that uses jdk8.0.252 linux_x64), which also runs in a docker container. I also downloaded the apm-agent and have it mounted in the container folder, “elastic-apm-agent-1.17.0.jar” (from maven). I started the application using

`“-javaagent:/opt/appagent/elastic-apm/elastic-apm-agent-1.17.0.jar -Delastic.apm.config_file=/<app-classpath-dir>/elasticapm.properties"` 

and the properties have these settings

    recording=true
    instrument=true
    service_name=ma
    hostname=ma
    environment=docker-dev
    transaction_max_spans=1000
    sanitize_field_names=<headers-to-exclude>
    ignore_urls=<urls-to-skip-like-healthy>
    server_urls=http://host.docker.internal:8200
    server_timeout=30000
    max_queue_size=2000
    api_request_time=30s
    api_request_size=2mb
    metrics_interval=15s
    application_packages=<app-java-package>
    stack_trace_limit=100

I frequently encounter the deadlock whenever the application server is started/restarted. Below is the stack info from the thread dump using jstack -l

    "elastic-apm-remote-config-poller":
    	at sun.misc.Unsafe.park(Native Method)
    	- parking to wait for  <0x00000000fbc93ab0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
    	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:213)
    	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
    	at co.elastic.apm.agent.report.ApmServerClient.openUrlConnectionThreadSafely(ApmServerClient.java:161)
    	at co.elastic.apm.agent.report.ApmServerClient.startRequestToUrl(ApmServerClient.java:124)
    	at co.elastic.apm.agent.report.ApmServerClient.execute(ApmServerClient.java:241)
    	at co.elastic.apm.agent.configuration.ApmServerConfigurationSource.fetchConfig(ApmServerConfigurationSource.java:163)
    	at co.elastic.apm.agent.configuration.ApmServerConfigurationSource.pollConfig(ApmServerConfigurationSource.java:131)
    	at co.elastic.apm.agent.configuration.ApmServerConfigurationSource.access$000(ApmServerConfigurationSource.java:53)
    	at co.elastic.apm.agent.configuration.ApmServerConfigurationSource$1.run(ApmServerConfigurationSource.java:117)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    "elastic-apm-server-healthcheck":
    	at java.util.logging.LogManager.ensureLogManagerInitialized(LogManager.java:317)
    	- waiting to lock <0x00000000fad07898> (a org.apache.juli.ClassLoaderLogManager)
    	at java.util.logging.LogManager.getLogManager(LogManager.java:378)
    	at java.util.logging.Logger.getPlatformLogger(Logger.java:572)
    	at java.util.logging.LoggingProxyImpl.getLogger(LoggingProxyImpl.java:41)
    	at sun.util.logging.LoggingSupport.getLogger(LoggingSupport.java:100)
    	at sun.util.logging.PlatformLogger$JavaLoggerProxy.<init>(PlatformLogger.java:602)
    	at sun.util.logging.PlatformLogger$JavaLoggerProxy.<init>(PlatformLogger.java:597)
    	at sun.util.logging.PlatformLogger.<init>(PlatformLogger.java:239)
    	at sun.util.logging.PlatformLogger.getLogger(PlatformLogger.java:198)
    	- locked <0x00000000fb277628> (a java.lang.Class for sun.util.logging.PlatformLogger)
    	at sun.net.www.protocol.http.HttpURLConnection.<clinit>(HttpURLConnection.java:432)
    	at sun.net.www.protocol.http.Handler.openConnection(Handler.java:62)
    	at sun.net.www.protocol.http.Handler.openConnection(Handler.java:57)
    	at java.net.URL.openConnection(URL.java:1002)
    	at co.elastic.apm.agent.report.ApmServerClient.openUrlConnectionThreadSafely(ApmServerClient.java:163)
    	at co.elastic.apm.agent.report.ApmServerClient.startRequestToUrl(ApmServerClient.java:124)
    	at co.elastic.apm.agent.report.ApmServerClient.executeForAllUrls(ApmServerClient.java:266)
    	at co.elastic.apm.agent.report.ApmServerHealthChecker.call(ApmServerHealthChecker.java:69)
    	at co.elastic.apm.agent.report.ApmServerHealthChecker.call(ApmServerHealthChecker.java:47)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    "main":
    	at sun.util.logging.PlatformLogger.redirectPlatformLoggers(PlatformLogger.java:209)
    	- waiting to lock <0x00000000fb277628> (a java.lang.Class for sun.util.logging.PlatformLogger)
    	at java.util.logging.LogManager$3.run(LogManager.java:402)
    	at java.util.logging.LogManager$3.run(LogManager.java:396)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at java.util.logging.LogManager.readPrimordialConfiguration(LogManager.java:396)
    	- locked <0x00000000fad07898> (a org.apache.juli.ClassLoaderLogManager)
    	at java.util.logging.LogManager.access$800(LogManager.java:145)
    	at java.util.logging.LogManager$2.run(LogManager.java:345)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at java.util.logging.LogManager.ensureLogManagerInitialized(LogManager.java:338)
    	- locked <0x00000000fad07898> (a org.apache.juli.ClassLoaderLogManager)
    	at java.util.logging.LogManager.getLogManager(LogManager.java:378)
    	at java.util.logging.Logger.demandLogger(Logger.java:448)
    	at java.util.logging.Logger.getLogger(Logger.java:502)
    	at com.sun.jmx.remote.util.ClassLogger.<init>(ClassLogger.java:55)
    	at sun.management.jmxremote.ConnectorBootstrap.<clinit>(ConnectorBootstrap.java:846)
    	at sun.management.Agent.startAgent(Agent.java:262)
    	at sun.management.Agent.startAgent(Agent.java:452)

    Found 1 deadlock.

Thanks for raising this. This is a know issue on Tomcat and Java 8. We've already fixed it: https://github.com/elastic/apm-agent-java/pull/1262

If you don't want to wait for the next release just grab a snapshot from master: https://github.com/elastic/apm-agent-java#snapshots

Thanks for the response, I'll try to pull the fix and test it. I'll post my findings here.

B/w, I wanted to confirm couple of things that I noticed on my local test.

  1. Contnuation of the above discussion: I also noticed that not overriding the defaults for server_timeout and api_request_time seems to work on local env. However I see the below exception. Is that same as https://www.elastic.co/guide/en/apm//server/current/common-problems.html#io-timeout
    2020-07-16 19:16:09,335 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Sending payload to APM server failed
    java.io.IOException: Server returned HTTP response code: 400 for URL: http://host.docker.internal:8200/intake/v2/events
    	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900)
    	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
    	at co.elastic.apm.agent.report.AbstractIntakeApiHandler.endRequest(AbstractIntakeApiHandler.java:129)
    	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.endRequest(IntakeV2ReportingEventHandler.java:162)
    	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.handleEvent(IntakeV2ReportingEventHandler.java:85)
    	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:73)
    	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:44)
    	at co.elastic.apm.agent.shaded.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
    	at co.elastic.apm.agent.shaded.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
    	at java.lang.Thread.run(Thread.java:748)
    2020-07-16 19:16:09,336 [elastic-apm-server-reporter] WARN  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - {
      "accepted": 0,
      "errors": [
        {
          "message": "read tcp 172.20.0.4:8200-\u003e172.20.0.1:37080: i/o timeout"
        }
      ]
    }
  1. Off topic: I used another Java agent (Jacoco), which uses javaassist and my server wouldn't start. This is only for dev env and not intended to use in production. Independently they seem to work fine. Not sure if the ordering of the javaagent initialization matter
    2020-07-17 15:01:37,489 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Scheduling next remote configuration reload in 30s
    Exception in thread "main" java.lang.reflect.InvocationTargetException
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386)
    	at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401)
    Caused by: java.lang.RuntimeException: Class java/util/UUID could not be instrumented.
    	at org.jacoco.agent.rt.internal_b0d6a23.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:139)
    	at org.jacoco.agent.rt.internal_b0d6a23.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:100)
    	at org.jacoco.agent.rt.internal_b0d6a23.PreMain.createRuntime(PreMain.java:55)
    	at org.jacoco.agent.rt.internal_b0d6a23.PreMain.premain(PreMain.java:47)
    	... 6 more
    Caused by: java.lang.NoSuchFieldException: $jacocoAccess
    	at java.lang.Class.getField(Class.java:1703)
    	at org.jacoco.agent.rt.internal_b0d6a23.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:137)
    	... 9 more
    FATAL ERROR in native method: processing of -javaagent failed

Please create a separate issue for that and make sure to tag it with server and java so that the APM Server team can take a look.

You could try to set the elastic agent as the second one.

Changing the order by placing the elastic agent at the end worked

I tried with the fix and restarted the server about 8-10 times and so far have not noticed the deadlock

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.