Problem when reindexing large index

Juris_Rats · January 29, 2018, 9:08am

Hi,
I have a 200GB 190M document one shard index and want to reindex it to the 5 shard index. I created the new index and launched reindexing with batch size=2000. Refresh interval for the new index was set to 600s. The reindex was running for about 60h. About 180M document were reindexed and the task was aborted. The log file is below. Could anybody help?

[2018-01-29T05:12:15,722][DEBUG][o.e.a.s.TransportSearchScrollAction] [W1] [10] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [W2][10.0.0.6:9300][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [10]
at org.elasticsearch.search.SearchService.findContext(SearchService.java:514) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:450) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.action.search.SearchTransportService$9.messageReceived(SearchTransportService.java:424) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.action.search.SearchTransportService$9.messageReceived(SearchTransportService.java:421) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1554) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:637) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.1.1.jar:6.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
[2018-01-29T05:12:15,800][WARN ][o.e.i.r.TransportReindexAction] [W1] giving up on search because it failed with a non-retryable exception
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.SearchScrollAsyncAction.onShardFailure(SearchScrollAsyncAction.java:269) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.action.search.SearchScrollAsyncAction$1.onFailure(SearchScrollAsyncAction.java:202) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:531) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1056) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport.lambda$handleException$31(TcpTransport.java:1476) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport$$Lambda$1739/1330804373.run(Unknown Source) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1474) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1466) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1396) [elasticsearch-6.1.1.jar:6.1.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) [transport-netty4-6.1.1.jar:6.1.1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.13.Final.jar:4.1.13.Final]
...

Mark_Harwood · January 29, 2018, 11:20am

Reindexing uses the scroll api under the covers to read a "point-in-time" view of the source data.

This point in time consists of a set of segments (Lucene files) that are essentially locked and prevented from being deleted by the usual segment merging process that works in the background to reorganise the index in response to ongoing CUD (create/update/delete) operations.

It is costly to preserve this view of data which is why users of the scroll API must renew their lock with each new request for a page of results. Locks will timeout if the client fails to return within the timespan which they said they would return.
The error you are seeing is because the reindex function has requested another page of results but the scroll ID which represents a lock on a set of files is either:

timed out (i.e the reindex client spent too long indexing the previous page) or
lost because the node serving the scroll api was restarted or otherwise became unavailable

Juris_Rats · January 29, 2018, 12:16pm

Thanks Mark for helping.
This should be the timeout as no nodes were restarted during the reindexing.
I am reindexing a development database used by nobody but me. No requests are directed to the old index while reindexing. How should I renew the lock while reindexing?

Mark_Harwood · January 29, 2018, 1:37pm

Ordinarily this occurs automatically as the reindex process cycles around fetching a batch of docs from the source then indexing into the target. It uses a generous lock timeout of 5 mins between each cycle and thus far has been sufficient for users (but a recent PR is adding support for a configurable value).
It seems odd that with a batch size of 2000 docs it took > 5 mins to complete that indexing task. Reducing the batch size would mean that each cycle would have less work to do and would therefore be more likely to complete within the 5 min window but 2k docs taking greater than 5 mins makes me suspicious of what else might have caused this failure. You don't have any "poison" docs do you (someone uploaded an encyclopedia) that might cause a blip? Any nasty queries hammering the memory? Would be worth checking the logs on the data nodes around this time to see if there were any pressures eg excessive GC.

Juris_Rats · January 29, 2018, 2:49pm

Thanks. Will explore

system · February 26, 2018, 2:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Reindex Big Index Elasticsearch	5	1822	January 15, 2018
Reindexing stuck at some batch and fails with 'search context missing exception' Elasticsearch	3	1758	June 13, 2019
Issue using reindex api on 6.2.3 Elasticsearch	3	640	November 23, 2018
Reindex failing with SearchContextMissingException: No search context found for id (batch size = 1) Elasticsearch	1	538	May 20, 2020
Issue with reindexing Elasticsearch	1	375	July 6, 2017

Problem when reindexing large index

Related topics