Indexing rate decaying dramatically in short period of time

Hi everyone,

Today I launched a reindex task for a big index (4.1TB) in order to provide it with a more appropiate number of shards (10 -> 90)

At the beginning it was really good (30K/s with spikes of 45K) but after 9 hours is around 7K/s which is quite disappointing as at this pace it will never finish.

What type of factors could be involved in this dramatic decrease in indexing performance ? i see segment count has stayed quite stable for most of the time

Yellow bar is the time I submitted the reindex task

Appreciate any inputs,

I believe the reindex API keeps the document ID when indexing, which means each indexing operation is in reality an update as it need to check if the document already exists. An update is much slower than an indexing operation where Elasticsearch is allowed to set the ID automatically as that can never result in an ID collision. If you have slow storage and large indices, the slowdown can be significant over time and is likely to continue deteriorating.

If you do not need to keep the document IDs one way to speed this up might be to reindex using an ingest pipeline that removes the _id field. I believe this should be possible but have not tried it. Be aware that this potentially could lead to duplicates in case the reindex process is forced to retry.

Depending on which version you are using, you may want to make sure you set up the index so you can use the split index API if needed in the future.

I will give this a try and let you know the outcome

POST _reindex?wait_for_completion=false&slices=20
{
  "source": {
    "index": "puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest",
    "size": 5000,
    "query": {
"range": {
      "ibi_logtime": {
        "gte": "now-9M/M"
      }
    }
    }
  },
  "dest": {
    "index": "puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-optimized",
     "pipeline": "ignore_id_pipeline"
  }
}

PUT _ingest/pipeline/ignore_id_pipeline
{
  "description" : "removes _id field from document in order to speed up reindex task (only insert, no update)",
  "processors" : [
    {
      "remove": {
        "field": "_id"
      }
    }
  ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.