NullPointerException in scroll

I've noticed this error today in the logs of our 5.4 cluster when running a scroll query - one that ran with no issues before and is running with no issues after this exception, which I was not able to reproduce.

Has anyone seen this before? Any idea what could be causing it?

[2017-08-29T11:17:29,137][DEBUG][o.e.a.s.TransportSearchScrollAction] [data1-iil-003] [569527] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [data1-iil-003][x.x.x.x:9300][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: java.lang.NullPointerException
at org.elasticsearch.search.DefaultSearchContext.isCancelled(DefaultSearchContext.java:810) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:138) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:294) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:391) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.search.SearchTransportService$10.messageReceived(SearchTransportService.java:374) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.search.SearchTransportService$10.messageReceived(SearchTransportService.java:371) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
[2017-08-29T11:17:29,179][WARN ][r.suppressed ] path: /_search/scroll, params: {scroll=300000ms}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.SearchScrollQueryAndFetchAsyncAction.onPhaseFailure(SearchScrollQueryAndFetchAsyncAction.java:159) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.search.SearchScrollQueryAndFetchAsyncAction.access$300(SearchScrollQueryAndFetchAsyncAction.java:42) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.search.SearchScrollQueryAndFetchAsyncAction$1.onFailure(SearchScrollQueryAndFetchAsyncAction.java:146) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1041) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1145) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1123) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService$7.onFailure(TransportService.java:638) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:623) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
Caused by: org.elasticsearch.transport.RemoteTransportException: [data1-iil-003][x.x.x.x:9300][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: java.lang.NullPointerException
at org.elasticsearch.search.DefaultSearchContext.isCancelled(DefaultSearchContext.java:810) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:138) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:294) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:391) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.search.SearchTransportService$10.messageReceived(SearchTransportService.java:374) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.search.SearchTransportService$10.messageReceived(SearchTransportService.java:371) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.0.jar:5.4.0]

This is a DEBUG level message and you should not be worried about it.

Actually, enabling DEBUG in production is not recommended and you should only do it when looking for specific issues.

Cheers

We don't have DEBUG logging level enabled, yet this error appeared anyway in the logs. Also I would not be worried about it if it didn't cause client queries to fail with a NPE error, but it does.

Well, the error is a DEBUG level log message. There is no way for it to show up in the logs if DEBUG level is not enabled.

But if you have a client that is getting the error then it explains why it is being logged as DEBUG in the server.

That NPE is being caused by the fact that the reference to SearchTask is null. Maybe @Igor_Motov could shed a light here?

I saw something like this recently in https://github.com/elastic/elasticsearch/issues/25820 We are still trying to figure out what's causing it. I suspect that it might have something to do with scroll timeouts, but I am not totally sure about the concrete mechanism.

@Dan_Markhasin do you see any other messages in the log file on other nodes around the same time? Which JVM and on which platform are you using to run elasticsearch?

We are running on Java 1.8.0.92 on SLES11SP4.

I do not know if this is related to scroll timeouts - from our logs I can see that the query that triggered this condition failed after ~2 seconds.
Also unlike the GitHub issue, there were no other errors in the logs on any of the machines except the one I pasted above - in fact these were the only entries in the ES log on that machine for the entire day when that happened.

I also have not been able to reproduce this - I have seen this happen twice so far, completely at random, and it has not reoccurred since I opened this thread.

I am still trying to figure out how this situation can occur, but I have a few more questions about your setup.

  • Do you know if this scroll query was executed against a singe shard or multiple shards?

  • Were there any pre-5.4.0 nodes in the cluster when this error occurred? Are you using any plugins in this cluster?

  • Do you know if it failed on the last page of the results or somewhere in the middle?

I do not know if this is related to scroll timeouts - from our logs I can see that the query that triggered this condition failed after ~2 seconds.

Can you clarify what you meant by this? According to the stack trace, this was a scroll query, which typically consists of multiple requests. Did it take 2 seconds between the first request that initiated the scroll and this last request that failed or it took 2 seconds just for the last request to fail?

  1. It was executed against a specific shard
  2. No, the cluster is homogenous (5.4.1)
  3. Unfortunately I don't know

When I said it failed after 2 seconds I meant that 2 seconds had passed since the first request that initiated the scroll to the request that failed.

Hmm, according to the stacktrace, the node that failed was 5.4.0, not 5.4.1.

Oh, sorry, you're correct. It is indeed 5.4.0. I mixed up with a different cluster that was 5.4.1 at the time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.