Nullpointer Exception thrown at WebfluxHelper.java:306

Hello Elastic APM server and Agent

APM Server version:
8.10.4

APM Agent language and version:
java : apm-agent-attach:jar:1.46.0, 1.48.0, 1.43.0
java : apm-agent-attach:jar:1.50.0 ( yesterday's build)

Programmatic attach is happening. Things are working in many other services. And we were happy.

From last few days a couple of services in our development k8s env under aws, application developers started this at ERROR level null pointer exception. Upon DEBUG /INFO enablement, some useful info came up.

2024-05-28 06:23:44 boundedElastic-1 ERROR  - Operator called default onErrorDropped
java.lang.NullPointerException: Cannot invoke "org.apache.tomcat.util.buf.MessageBytes.equalsIgnoreCase(String)" because the return value of "org.apache.tomcat.util.http.MimeHeaders.getName(int)" is null
	at org.apache.tomcat.util.http.NamesEnumerator.findNext(MimeHeaders.java:423)
	at org.apache.tomcat.util.http.NamesEnumerator.nextElement(MimeHeaders.java:447)
	at org.apache.tomcat.util.http.NamesEnumerator.nextElement(MimeHeaders.java:405)
	at org.springframework.http.server.reactive.TomcatHeadersAdapter$EntryIterator.next(TomcatHeadersAdapter.java:211)
	at org.springframework.http.server.reactive.TomcatHeadersAdapter$EntryIterator.next(TomcatHeadersAdapter.java:200)
	at java.base/java.util.Iterator.forEachRemaining(Unknown Source)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Unknown Source)
	at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(Unknown Source)
	at java.base/java.util.stream.AbstractPipeline.evaluate(Unknown Source)
	at java.base/java.util.stream.ReferencePipeline.collect(Unknown Source)
	at org.springframework.http.ReadOnlyHttpHeaders.entrySet(ReadOnlyHttpHeaders.java:156)
	at co.elastic.apm.agent.springwebflux.WebfluxHelper.copyHeaders(WebfluxHelper.java:306)
	at co.elastic.apm.agent.springwebflux.WebfluxHelper.fillResponse(WebfluxHelper.java:296)
	at co.elastic.apm.agent.springwebflux.WebfluxHelper.endTransaction(WebfluxHelper.java:175)

Please note: this is happening in a few services out 50+ services only.

Also, this is happening in last few weeks.

Even on 1.50.0 , same null pointer exception is appearing.

Any help will be great. As of now wherever things are not working, disabling of attch-apm is being recommended.

Logs are here

Hi,

The problem seems to be an implementation bug in Tomcat implementation, the agent is just triggering it by reading the HTTP headers of the HTTP response.

As a simple work-around, you can disable the HTTP headers capture (which is on by default by setting capture_headers=false (doc), the compromise here is that the HTTP headers won't be captured for this service.

Can you try to set this configuration and tell us if that helps to work-around this issue ?

Hello Sylvain_Juge,

Your suggested workaround stopped giving the exceptions.

But as you said, we will not be able to capture headers.

Let's know the solution :). As you conveyed it is Tomcat's webserver's issue. But it would be good if we get a fix.

What we do is... we keep all environment values in configmap/secret and ask the services to 'import' these configmap and secret. This arrangement enables the operations team to control from single place.

Now, the maintenance will go up because affected services will have to maintain the capture headers as false.

Thank you very much for the quick support

@Sylvain_Juge
When you say bug in Tomcat Implementation:

My 'service is acting as rest end point' that uses tomcat.
Should I target fixing my tomcat implementation?

Any hints towards this question should help me to look for right direction for a solution.

The main problem here is that bug-fixing in Tomcat/Spring is mostly outside of our control, and even if we managed to properly get those changes merged it would then require you to wait for the next release and then update, which is likely a non-trivial burden compared to a simple configuration option.

In order to make the configuration easier though, you could use central configuration and set this capture_headers globally.

If you could provide a simple test application that allows to reproduce this outside of your codebase, then it would allow us to investigate further and maybe implement a proper bugfix in Spring/Tomcat, given that this is the first time this issue is reported it is very likely due to a very specific combination of events in the request processing or components versions and configuration.

Also, maybe trying to update to a more recent version of Tomcat and/or Spring could help here, but again without any guarantee.

1 Like

@Sylvain_Juge

Things are not working as expected when we disabled capture-headers. Still exceptions are coming

We are using k8s/Java / programmatic attach.

environment in deployment is like this:

            - name: ELASTIC_APM_CAPTURE_HEADERS
              value: 'false'

btw, we did not deploy the latest of spring

btw, this time we are getting different exception:

2024-06-07 10:03:01 boundedElastic-32 ERROR  - Operator called default onErrorDropped
java.lang.IllegalStateException: The request object has been recycled and is no longer associated with this facade
	at org.apache.catalina.connector.RequestFacade.checkFacade(RequestFacade.java:855)
	at org.apache.catalina.connector.RequestFacade.getRemoteHost(RequestFacade.java:405)
	at org.springframework.http.server.reactive.ServletServerHttpRequest.getRemoteAddress(ServletServerHttpRequest.java:192)
	at co.elastic.apm.agent.springwebflux.WebfluxHelper.fillRequest(WebfluxHelper.java:267)
	at co.elastic.apm.agent.springwebflux.WebfluxHelper.endTransaction(WebfluxHelper.java:174)
	at co.elastic.apm.agent.springwebflux.TransactionAwareSubscriber.onComplete(TransactionAwareSubscriber.java:177)
	at co.elastic.apm.agent.reactor.TracedSubscriber.onComplete(TracedSubscriber.java:141)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondComplete(MonoFlatMap.java:250)
	at reactor.core.publisher.MonoFlatMap$FlatMapInner.onComplete(MonoFlatMap.java:324)
	at co.elastic.apm.agent.reactor.TracedSubscriber.onComplete(TracedSubscriber.java:141)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondComplete(MonoFlatMap.java:250)
	at reactor.core.publisher.MonoFlatMap$FlatMapInner.onComplete(MonoFlatMap.java:324)
	at co.elastic.apm.agent.reactor.TracedSubscriber.onComplete(TracedSubscriber.java:141)
	at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.onComplete(Operators.java:2231)
	at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onComplete(FluxOnAssembly.java:549)
	at co.elastic.apm.agent.reactor.TracedSubscriber.onComplete(TracedSubscriber.java:141)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onComplete(FluxHide.java:147)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondComplete(MonoFlatMap.java:250)
	at reactor.core.publisher.MonoFlatMap$FlatMapInner.onComplete(MonoFlatMap.java:324)
	at co.elastic.apm.agent.reactor.TracedSubscriber.onComplete(TracedSubscriber.java:141)
	at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onComplete(FluxContextWrite.java:126)
	at co.elastic.apm.agent.reactor.TracedSubscriber.onComplete(TracedSubscriber.java:141)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onComplete(FluxHide.java:147)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondComplete(MonoFlatMap.java:250)
	at reactor.core.publisher.MonoFlatMap$FlatMapInner.onComplete(MonoFlatMap.java:324)
	at co.elastic.apm.agent.reactor.TracedSubscriber.onComplete(TracedSubscriber.java:141)
	at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onComplete(FluxContextWrite.java:126)
	at co.elastic.apm.agent.reactor.TracedSubscriber.onComplete(TracedSubscriber.java:141)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onComplete(FluxHide.java:147)
	at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onComplete(MonoPeekTerminal.java:299)

After investigating Tomcat source code it seems it could be related to having one of the following configured in Tomcat:

  • security manager
  • discardFacades set on the HTTP connector or org.apache.catalina.connector.RECYCLE_FACADES=true

Here the problem is that the Weblux instrumentation is accessing a "facade" object that is getting recycled by tomcat before the instrumentation tries to access it, hence triggering the issue.

I have also found quite a few reports with similar error messages with other products, so you are probably not the first one to have this problem.

So the next thing to do here is to check if you have any those things enabled, if yes then try without them and then evaluate if they are really required or not (and by saying that I understand that security manager might not be optional)

Will study this input and come back to you.

Hi, we are experiencing the same issue. Reading this issue the origin of the problem is a change in tomcat's default behaviour.

How could this be fixed on apm agent side? I dont think we should rely on changing tomcats behaviour, they could even remove that possibility in the future.

Plus, if RECYCLE_FACADES=false couldnt then the agent be accessing data from another request? since the same request facade is reused. Instead of throwing the error like it is now maybe the agent silently reports incorrect data

We managed to fix the issue.

The problem was a transitive dependency to org.springframework.boot:spring-boot-starter-tomcat. By removing this transitive dependecy, the application started using Netty, which is the default for Webflux applications.

Its still a mistery why this error was happening only on production environment and nowhere else.

1 Like