Indexing rate decaying dramatically in short period of time

viniciof · September 10, 2019, 2:29am

Hi everyone,

Today I launched a reindex task for a big index (4.1TB) in order to provide it with a more appropiate number of shards (10 -> 90)

At the beginning it was really good (30K/s with spikes of 45K) but after 9 hours is around 7K/s which is quite disappointing as at this pace it will never finish.

What type of factors could be involved in this dramatic decrease in indexing performance ? i see segment count has stayed quite stable for most of the time

Yellow bar is the time I submitted the reindex task

Appreciate any inputs,

Christian_Dahlqvist · September 10, 2019, 5:50am

I believe the reindex API keeps the document ID when indexing, which means each indexing operation is in reality an update as it need to check if the document already exists. An update is much slower than an indexing operation where Elasticsearch is allowed to set the ID automatically as that can never result in an ID collision. If you have slow storage and large indices, the slowdown can be significant over time and is likely to continue deteriorating.

If you do not need to keep the document IDs one way to speed this up might be to reindex using an ingest pipeline that removes the _id field. I believe this should be possible but have not tried it. Be aware that this potentially could lead to duplicates in case the reindex process is forced to retry.

Depending on which version you are using, you may want to make sure you set up the index so you can use the split index API if needed in the future.

viniciof · September 10, 2019, 1:49pm

I will give this a try and let you know the outcome

POST _reindex?wait_for_completion=false&slices=20
{
  "source": {
    "index": "puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest",
    "size": 5000,
    "query": {
"range": {
      "ibi_logtime": {
        "gte": "now-9M/M"
      }
    }
    }
  },
  "dest": {
    "index": "puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-optimized",
     "pipeline": "ignore_id_pipeline"
  }
}

PUT _ingest/pipeline/ignore_id_pipeline
{
  "description" : "removes _id field from document in order to speed up reindex task (only insert, no update)",
  "processors" : [
    {
      "remove": {
        "field": "_id"
      }
    }
  ]
}

system · October 8, 2019, 1:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow reindex operation on heavy index Elasticsearch	4	6904	October 6, 2019
Improving performance of reindex API? Elasticsearch	7	12146	July 5, 2017
How to optimize a reindex operation to perform really fast on big source index Elasticsearch	1	353	October 7, 2019
Reindexing throughput degrades over time Elasticsearch reindex	2	463	March 24, 2021
Reindex API - Extremely Slow Elasticsearch	2	1487	March 16, 2019

Indexing rate decaying dramatically in short period of time

Related topics