Instrumentation of Jenkins fails, a thread dies unexpectedly due to an uncaught exception

Hi,

I tried to instrument a jenkins instance, but the thread for the elastic apm agent dies.

Kibana version: 7.2

Elasticsearch version: 7.2

APM Server version: 7.2

APM Agent language and version: Java, 1.7

Steps to reproduce:

  1. Install Jenkins
  2. Add the following to JENKINS_JAVA_OPTIONS:
    -javaagent:/opt/elastic-apm-agent-1.7.0.jar -Delastic.apm.disable_instrumentation='' -Delastic.apm.application_packages=hudson,jenkins,org.eclipse -Delastic.apm.trace_methods=hudson.,jenkins.,org.eclipse.* -Delastic.apm.service_name=jenkins -Delastic.apm.server_url=http://apm-server:8200"
  3. Restart Jenkins

Provide logs and/or server output (if relevant):
2019-07-05 12:25:55.744+0000 [id=13] SEVERE h.i.i.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler#uncaughtException: A thread (apm-request-timeout-timer/13) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
java.lang.IllegalStateException: Ring buffer has no available slots
at co.elastic.apm.agent.report.ApmServerReporter.flush(ApmServerReporter.java:173)
at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler$FlushOnTimeoutTimerTask.run(IntakeV2ReportingEventHandler.java:412)
at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
at java.base/java.util.TimerThread.run(Timer.java:506)
2019-07-05 14:26:21.329 [apm-reporter] INFO co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Backing off for 0 seconds (+/-10%)
2019-07-05 14:26:21.329 [apm-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Error writing request body to server, response code is -1
2019-07-05 14:26:21.330 [apm-reporter] WARN co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - null
2019-07-05 14:26:21.332 [apm-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type SPAN with this error: Timer already cancelled.
2019-07-05 14:26:21.333 [apm-reporter] INFO co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Backing off for 1 seconds (+/-10%)

Any ideas?

Best regards,
Robert

Hi and thanks for reporting!

Looks like you are right- the Timer behaves as if it was cancelled when the main loop throws an Exception. We will look into that.

Does this reproduces every time?
Was there a proper connection with the APM server prior to that? You can see that in the top of the agent log.

Hi,

This happened once shortly after a restart of the jenkins, today it run fine for several hours, but crashed now.

Shortly before the crash, lines like the following appeared in the log:

2019-07-10 15:10:30.924 [apm-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Server returned HTTP response code: 503 for URL: http://192.168.122.150:8200/intake/v2/events, response code is 503
2019-07-10 15:10:30.924 [apm-reporter] WARN co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - {
"accepted": 890,
"errors": [
{
"message": "queue is full"
}
]
}

Oh no!
Please send the log from this error message and until the end, and any other server log, if there's such, that may contain info about the crash.

For what i saw before I made a pull request. Try using it's product - this snapshot build.

I'll have the new version in use and report back later.

Thanks, though note it is not an official version, it's a snapshot build

So, it seems, that has fixed the issue.

Great! Thanks for the update.
Please let us know if something changes.