NPE using APM Agent 1.34.1

The same error that was reported in NPE using APM Agent 1.33.0 still happens in Java agent 1.34.1.

java.lang.NullPointerException
	at co.elastic.apm.agent.impl.transaction.Span.afterEnd(Span.java:289)
	at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:601)
	at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:580)
	at co.elastic.apm.agent.httpclient.ApacheHttpClientInstrumentation$ApacheHttpClientAdvice.onAfterExecute(ApacheHttpClientInstrumentation.java:99)

Thanks for reporting this @barath ,

What is the frequency of this issue ? Are you able to reproduce it in your environment ?

I have just opened PR #2859 that should fix this issue.
Could you try with this snapshot and tell us if that makes a difference ?

Thanks for quick response. It happens about three times in an hour (we have about 2300 request per minute).

I'll test your fix in our system soon and come back to you when I have results.

It seems your fix solved the issue. I deployed the snapshot yesterday afternoon and the error haven't happened since but I've got an another one:

java.lang.NullPointerException
    at co.elastic.apm.agent.impl.transaction.Span.tryToCompressRegular(Span.java:369)
    at co.elastic.apm.agent.impl.transaction.Span.tryToCompress(Span.java:317)
    at co.elastic.apm.agent.impl.transaction.Span.afterEnd(Span.java:286)
    at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:601)
    at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:580)
    at co.elastic.apm.agent.httpclient.ApacheHttpClientInstrumentation$ApacheHttpClientAdvice.onAfterExecute(ApacheHttpClientInstrumentation.java:99)

Although it happened only once.

Thanks for checking the snapshot and reporting this other NPE, I'll take a look.
Even if it is not triggered that often, it still means something is off here.

@barath I attempted to address the recent reported concurrency issue through PR #2865.
Please try this snapshot to see whether the is resolved.

I've deployed your snapshot but unfortunately the exception is still thrown:

java.lang.NullPointerException
	at co.elastic.apm.agent.impl.transaction.Span.afterEnd(Span.java:289)
	at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:600)
	at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:579)
	at co.elastic.apm.agent.httpclient.ApacheHttpClientInstrumentation$ApacheHttpClientAdvice.onAfterExecute(ApacheHttpClientInstrumentation.java:99)

OK, thanks for reporting.
I actually recognized there was still a race condition in the former implementation, but didn't get to that so far.
I now changed this part of the code to be completely thread-safe, so please try out this build. If there is still an issue, it means I didn't identify the exact issue and will have to look elsewhere.
Looking forward to get your feedback.

@barath did you get the chance to test the latest snapshot?

@Eyal_Koren Yes, I've deployed the latest snapshot on Tuesday last week. I waited a few days to see what happen. Unfortunately NPE has been thrown twice (on Saturday and today) since Tuesday.

java.lang.NullPointerException
 	at co.elastic.apm.agent.impl.transaction.Span.afterEnd(Span.java:289)
 	at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:604)
 	at co.elastic.apm.agent.impl.transaction.AbstractSpan.end(AbstractSpan.java:579)
 	at co.elastic.apm.agent.httpclient.ApacheHttpClientInstrumentation$ApacheHttpClientAdvice.onAfterExecute(ApacheHttpClientInstrumentation.java:99)

OK, then I guess it is coming from a different race condition...
I'll take a look once I get the chance.
Thanks for updating.

@barath I had a chance to look at this again this morning.
What seems weird is that you get this stack trace consistently.
If the latest snapshot is used, this means that the Span#parent field is null.
While I cannot exclude the possibility that some race condition causes this, it should have the same chances of occurring in prior lines, like Span.java:288, Span.java:268 or Span.java:266.
If you always get it at Span.java:289, I would ask you to verify that this is indeed coming from a JVM that indeed uses the latest snapshot, and not some other snapshot/version.

Please describe your environment- do you have only one JVM? If you have multiple JVMs, are they logging to the same place? Is there a possibility that this error doesn't come from an agent of the latest snapshot? If not, did you see similar stack traces, but where the top frame in the stack trace is from a different line?

@Eyal_Koren We have 3 Wildfly instances (Java 11.0.5) running in docker containers on three different hosts. They aren't logging to the same place so stack traces aren't mixed up for sure. The top frame is always the same.

Elastic APM agent version is 1.34.2-SNAPSHOT.5079263. Can you confirm this is the latest one?

Yes, this looks like the right snapshot. We'll take another look.

Hi @barath,

Would it be possible for you to provide us with a full INFO-level agent log from the startup until the error occurs in the logs? This might help our investigations.

Looks like we found the reason why the error occurs at exactly the line you described.
We updated our fix to guard against it.

Could you try this latest snapshot which contains the fix and report back whether this fixes the issue for you?

Thank you!

1 Like

Hi @Jonas_Kunz , I've deployed the latest snapshot. Let's wait a few days to see what happen. This NPE issue is very rare.

1 Like

@Jonas_Kunz I haven't got NPE since the update. It can be plain luck or the issue is resolved. Only time can tell.

Thanks for reporting back! We are confident that we solved the issue. We also added a test which was able to reproduce your specific exception when run on a build without the fix.

1 Like

That's a good news. I'm glad I could help.