Reindexing stuck at some batch and fails with 'search context missing exception'

#1

I'm reindexing all my indexes from old cluster(5.x.x) to new cluster (7.0.0).

I have some big indexes of size around 3GB. (For which i had to specify 'size' as low as 10 to reindex) (The number of documents in those indices are around 50k)

But while reindexing these big indices, the reindex get's stuck after some time (I'm monitoring progress from task ID).

{
  "completed" : false,
  "task" : {
    "node" : "eRSnrCA-QQ6v0CTfQzwSGw",
    "id" : 4403759,
    "type" : "transport",
    "action" : "indices:data/write/reindex",
    "status" : {
      "total" : 49449,
      "updated" : 37170,
      "created" : 0,
      "deleted" : 0,
      "batches" : 3718,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled" : "0s",
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until" : "0s",
      "throttled_until_millis" : 0
    },
    "description" : "reindex from [host= port=9200 query={\n  \"match_all\" : {\n    \"boost\" : 1.0\n  }\n}][] to [][_doc]",
    "start_time" : "2019-05-16T15:22:48.011Z",
    "start_time_in_millis" : 1558020168011,
    "running_time" : "1h",
    "running_time_in_nanos" : 3813534788227,
    "cancellable" : true,
    "headers" : { }
  }
}

The batch number and 'updated' fields have been same since more than half an hour.

Thanks,
Sanjay

#2

In elasticsearch log I see:

{"log":"failing shard [failed shard, shard [index][0], node[p1xD7RV5QGa0CWCmsvDS-Q], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=wYmxVF46R22vjeqN0XiBMQ], unassigned_info[[reason=NODE_LEFT], at[2019-05-16T15:37:12.558Z], delayed=false, details[node_left [p1xD7RV5QGa0CWCmsvDS-Q]], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[index][0]: Recovery failed from {hostt}{eRSnrCA-QQ6v0CTfQzwSGw}{lYIG3CWtRe6t1s6qvISXVA}{}{} into {nodet}{p1xD7RV5QGa0CWCmsvDS-Q}{wdWj_4sPSmCyzBNAEJhAnw}{}{}]; nested: RemoteTransportException[[master][:9300][internal:index/shard/recovery/start_recovery]]; nested: ReceiveTimeoutTransportException[[nodet][:9300][internal:index/shard/recovery/translog_ops] request_id [4152318] timed out after [1799903ms]]; ], markAsStale [true]]\n","stream":"stdout","time":"2019-05-16T16:10:44.025627354Z"}

By this it seems like the index shard has failed because node had left during the reindexing and then rejoined back.

I see multiple instances where nodes leave and joins back. I couldn't figure out why was this happening.

#3

Reindexing failed with this error:

  "error" : {
    "type" : "status_exception",
    "reason" : "body={\"error\":{\"root_cause\":[{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [15211]\"},{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [62274]\"},{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [68065]\"},{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16491]\"},{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16492]\"}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [15211]\"}},{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [62274]\"}},{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [68065]\"}},{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16491]\"}},{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16492]\"}}],\"caused_by\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16492]\"}},\"status\":404}",
    "caused_by" : {
      "type" : "response_exception",
      "reason" : "method [POST], host [:9200], URI [/_search/scroll?scroll=300000000000nanos], status line [HTTP/1.1 404 Not Found]\n{\"error\":{\"root_cause\":[{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [15211]\"},{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [62274]\"},{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [68065]\"},{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16491]\"},{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16492]\"}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [15211]\"}},{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [62274]\"}},{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [68065]\"}},{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16491]\"}},{\"shard\":-1,\"index\":null,\"reason\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16492]\"}}],\"caused_by\":{\"type\":\"search_context_missing_exception\",\"reason\":\"No search context found for id [16492]\"}},\"status\":404}"
    }
  }
}