Elasticsearch Reindex API - reindex only missing docs

Hi together,

we try to reindex big Indices ( about 10 million docs per index ) with the curl command:

curl -XPOST 'http://localhost:9200/_reindex?slices=5&refresh' -d '{
  "conflicts": "proceed",
  "source": {
    "index": "'.$index.'",
    "size": 10000
  },
  "dest": {
    "index": "'.$index.$version_string.'",
    "op_type": "create"
  }
}'

The reindex process has done a good and completely job in most indices.
Two indices, however, the Reindex breaks off again and again.
In an affected index are still 500 docs to reindexing and I am trying again and again to reindex the missing docs. Unfortunately unsuccessful.

How can i reindex only the missing docs between two indices or how must I have to modify my Reindex command to the effect that the process goes completely through the reindex?

Sometimes the Reindex process throws "SearchContextMissingExceptions" - if it can not resolve an Scroll-ID ; or sometimes a data-store leaves temporarily the cluster and there comes a "node_not_connected_exception".

This was my last unsuccessful try:

count v1: 8039457
count v2: 8038957

+++ COUNT IS DIFFERENT ;; Start reindexing 2016_10 for the 64 time

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   276    0   152    0   124      0      0 --:--:--  1:10:11 --:--:--     6

+++RESPONSE: {"took":4211061,"timed_out":false,"total":8039457,"updated":0,"created":0,"batches":804,"version_conflicts":8035457,"noops":0,"retries":0,"failures":[]}

For more background-informations the response of the current reindex-tasks for this two indices:

{
  "nodes": {
    "LG1ycx-6STKYenLnqSMZIg": {
      "name": "client_xx",
      "transport_address": "x.x.x.x:9300",
      "host": "x.x.x.x",
      "ip": "x.x.x.x:9300",
      "attributes": {
        "rack": "xxx",
        "rack_id": "xxx",
        "data": "false",
        "master": "false"
      },
      "tasks": {
        "LG1ycx-6STKYenLnqSMZIg:5559629": {
          "node": "LG1ycx-6STKYenLnqSMZIg",
          "id": 5559629,
          "type": "transport",
          "action": "indices:data/write/reindex",
          "status": {
            "total": 5150349,
            "updated": 0,
            "created": 0,
            "deleted": 0,
            "batches": 231,
            "version_conflicts": 2310000,
            "noops": 0,
            "retries": 0
          },
          "description": "",
          "start_time_in_millis": 1492754781226,
          "running_time_in_nanos": 1222719041139
        },
        "LG1ycx-6STKYenLnqSMZIg:5551067": {
          "node": "LG1ycx-6STKYenLnqSMZIg",
          "id": 5551067,
          "type": "transport",
          "action": "indices:data/write/reindex",
          "status": {
            "total": 8039457,
            "updated": 0,
            "created": 0,
            "deleted": 0,
            "batches": 706,
            "version_conflicts": 7060000,
            "noops": 0,
            "retries": 0
          },
          "description": "",
          "start_time_in_millis": 1492752458445,
          "running_time_in_nanos": 3545465607841
        }
      }
    }
  }
}

There isn't really anything reindex can do about this. The scrolls aren't resumable on another node. It is probably worth figuring out why this happens in your cluster and fixing it. But you should be able to work around it by chunking the reindex processes by filtering on some field in your documents. Like time or some keyword field or something.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.