Problem when reindexing large index

I have a 200GB 190M document one shard index and want to reindex it to the 5 shard index. I created the new index and launched reindexing with batch size=2000. Refresh interval for the new index was set to 600s. The reindex was running for about 60h. About 180M document were reindexed and the task was aborted. The log file is below. Could anybody help?

[2018-01-29T05:12:15,722][DEBUG][o.e.a.s.TransportSearchScrollAction] [W1] [10] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [W2][][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: No search context found for id [10]
at ~[elasticsearch-6.1.1.jar:6.1.1]
at ~[elasticsearch-6.1.1.jar:6.1.1]
at$9.messageReceived( ~[elasticsearch-6.1.1.jar:6.1.1]
at$9.messageReceived( ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived( ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun( ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun( ~[elasticsearch-6.1.1.jar:6.1.1]
at ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun( ~[elasticsearch-6.1.1.jar:6.1.1]
at ~[elasticsearch-6.1.1.jar:6.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$ ~[?:1.8.0_45]
at [?:1.8.0_45]
[2018-01-29T05:12:15,800][WARN ][o.e.i.r.TransportReindexAction] [W1] giving up on search because it failed with a non-retryable exception all shards failed
at [elasticsearch-6.1.1.jar:6.1.1]
at$1.onFailure( [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException( [elasticsearch-6.1.1.jar:6.1.1]
at$ConnectionCountingHandler.handleException( [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException( [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport.lambda$handleException$31( [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport$$Lambda$1739/ Source) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute( [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport.handleException( [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport.handlerResponseError( [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport.messageReceived( [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead( [transport-netty4-6.1.1.jar:6.1.1]
at [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( [netty-codec-4.1.13.Final.jar:4.1.13.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( [netty-codec-4.1.13.Final.jar:4.1.13.Final]

Reindexing uses the scroll api under the covers to read a "point-in-time" view of the source data.

This point in time consists of a set of segments (Lucene files) that are essentially locked and prevented from being deleted by the usual segment merging process that works in the background to reorganise the index in response to ongoing CUD (create/update/delete) operations.

It is costly to preserve this view of data which is why users of the scroll API must renew their lock with each new request for a page of results. Locks will timeout if the client fails to return within the timespan which they said they would return.
The error you are seeing is because the reindex function has requested another page of results but the scroll ID which represents a lock on a set of files is either:

  1. timed out (i.e the reindex client spent too long indexing the previous page) or
  2. lost because the node serving the scroll api was restarted or otherwise became unavailable

Thanks Mark for helping.
This should be the timeout as no nodes were restarted during the reindexing.
I am reindexing a development database used by nobody but me. No requests are directed to the old index while reindexing. How should I renew the lock while reindexing?

Ordinarily this occurs automatically as the reindex process cycles around fetching a batch of docs from the source then indexing into the target. It uses a generous lock timeout of 5 mins between each cycle and thus far has been sufficient for users (but a recent PR is adding support for a configurable value).
It seems odd that with a batch size of 2000 docs it took > 5 mins to complete that indexing task. Reducing the batch size would mean that each cycle would have less work to do and would therefore be more likely to complete within the 5 min window but 2k docs taking greater than 5 mins makes me suspicious of what else might have caused this failure. You don't have any "poison" docs do you (someone uploaded an encyclopedia) that might cause a blip? Any nasty queries hammering the memory? Would be worth checking the logs on the data nodes around this time to see if there were any pressures eg excessive GC.

Thanks. Will explore

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.