Reindexing many indexes failing. Please help!

Hi, I am trying to reindex many indices in to one ILM based alias/indices...

However I get the below error, smallish DB at 150GB but many indices 3000 (hence the re-structuring)

First it complained about too many scroll contexts, so I then increased the limit scroll contexts... now I get the below.

This is running on a c6gd.2xlarge AWS EC2 instance. With nvme drive to speed it up.
Elasicsearch version 8.18

My request is as follows

POST _reindex?wait_for_completion=false&slices=auto
{
  "source": {
    "index": "dataset-*"
  },
  "dest": {
    "index": "measurement-data",
    "op_type": "create"
  },
  "script": {
    "inline": "ctx._source.dataseriesId = ctx._source.remove(\"id\"); ctx._source.datasetId = ctx._index.substring(8);",
    "lang": "painless"
  }
}

TASK status with errors at the bottom

{
  "completed": true,
  "task": {
    "node": "91hv2jovSRqTYMy6Z2HmLw",
    "id": 32814079,
    "type": "transport",
    "action": "indices:data/write/reindex",
    "status": {
      "total": 2611898649,
      "updated": 0,
      "created": 3000,
      "deleted": 0,
      "batches": 3,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 0,
      "requests_per_second": -1,
      "throttled_until_millis": 0,
      "slices": [
        {
          "slice_id": 0,
          "total": 1188375375,
          "updated": 0,
          "created": 2000,
          "deleted": 0,
          "batches": 2,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 1,
          "total": 1423523274,
          "updated": 0,
          "created": 1000,
          "deleted": 0,
          "batches": 1,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        }
      ]
    },
    "description": "reindex from [dataset-*] updated with Script{type=inline, lang='painless', idOrCode='ctx._source.dataseriesId = ctx._source.remove(\"id\"); ctx._source.datasetId = ctx._index.substring(8);', options={}, params={}} to [measurement-data]",
    "start_time_in_millis": 1748518640325,
    "running_time_in_nanos": 979225413,
    "cancellable": true,
    "cancelled": false,
    "headers": {}
  },
  "response": {
    "took": 957,
    "timed_out": false,
    "total": 2611898649,
    "updated": 0,
    "created": 3000,
    "deleted": 0,
    "batches": 3,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
      "bulk": 0,
      "search": 0
    },
    "throttled": "0s",
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until": "0s",
    "throttled_until_millis": 0,
    "slices": [
      {
        "slice_id": 0,
        "total": 1188375375,
        "updated": 0,
        "created": 2000,
        "deleted": 0,
        "batches": 2,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 1,
        "total": 1423523274,
        "updated": 0,
        "created": 1000,
        "deleted": 0,
        "batches": 1,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      }
    ],
    "failures": [
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x00000000744f2c48@6698fec7], creationTimeNanos=167646586603173, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name =[NAME]/search, queue capacity = 1000, task execution EWMA = 1ms, total task execution time = 9.2m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 30, active threads = 30, queued tasks = 992, completed tasks = 1697334]]"
        }
      },
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x00000000744f2c48@3d680ccf], creationTimeNanos=167646587724683, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = [NAME]/search, queue capacity = 1000, task execution EWMA = 1.4ms, total task execution time = 9.2m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 30, active threads = 29, queued tasks = 999, completed tasks = 1697355]]"
        }
      },
     **... (with many more removed)**
    ]
  }
}

Any help/suggestions appreciate to get around this issue. Also anything that can make this reindex process optimised. I think it would currently take a good few hours to complete. Ideally I'd be able to speed it up...

just a suggestion, write a short script and loop over the required indices one at a time.

Reasoning is you have to do it once, right?

I think it would currently take a good few hours to complete

So? Thats a serious question, what would be the problem with that ?

1 Like

That is probably the best way to do it. If you are indexing into a set of indices or data stream that relies on rollover you generally want to index the data in at least approximate timestamp order as that will make it age out in the expected order. If you just specify all indices you do not control the order and timestamp order will not be maintained.

Hi both, thanks for your replies...

@RainTown this is a live production database, I'd like to minimise downtime. I've had to introduce a breaking change, requiring the restructing of the DB. So I need to stop the in flow of data while I am reindexing.... that is unwelcome and should be minimised.

@RainTown @Christian_Dahlqvist A script is a possible way forward, I guess I'd need to monitor the tasks, so that I don't overload the server? i.e. only start a few at a time? Also It sounds like I should be querying my data in time order... as each of the 3000 odd indices have their own data over time... i.e. each index is not a particular period, but rather a particular device.

Am I correct that the failure reported above relates to the server being overloaded? Is there a way to sensibly throttle it

i.e. What is the recommended approach to managing load so as not to get the same error by running too many at the same time. I'd have thought elasticsearch would manage its own load and not fall over?

If I was in your situation I would probably set up a new rollover index or data stream and direct all new data there. That will ensure all new data is indexed in timestamp order. You should then be able to deal with the existing indices in the cluster without downtime.

If you then want to reduce the number of shard and reindex the old data you could index one or a couple of old indices at a time into traditional time-based indices that do not use rollover but have the date they cover in the name. You should be able to create an ingest pipeline that determines the correct index name based on the timestamp. This would best be done using a script. Note that you may get duplicates while reindexing occurs before the old data is deleted.

You can then create a separate ILM policy for these indices and have them aged out based on the index names.

We all (usually) want to minimize downtime, but from what you posted your overall plan is not clear - that's why we ask questions! Remember you know wayyyy more than we do about your situation, data, and use cases, and what you are trying to do, and why. And, IMO you dont really have a database, not as that term is usually understood. I know that's semantics but ... It is also not a particularly important distinction here admittedly.

How many of those 3000 indices are being actively written to? Are they even time based in any way, e.g. dataset-2025-05-29, -28, -28, -27, -26, ... or something similar? Or do we have dataset-red, dataset-green, dataset-London, etc? Are there document updates and/or deletes going on in many of these indices, or are most of the dataset-* indices effectively frozen / read-only?

My first guess was that you likely dont have too many indices being actively written to, as low as one (or even zero, we can't know), so you can "process" all the others, without any downtime, slowly even, to point where you would be 90-something percent done in the background. Then, block things, process last few, change what you need to for the new index name(s), check, check again, start things flowing again, check again again.

From what you shared, it didn't fall over. It probably saved itself from falling over.

Hi both, thanks again for your responses.

I have created a script to run this. That does it month by month. I still end up getting the above error. I then reduced it to day by day... that still ends up getting that error.

It seems like the the individual reindex processes run and completes OK etc, and then after a whole bunch of them have run OK, they start failing (after about 100M docs reindexed OK). The failed ones have only processed and created a subset of the total docs

e.g. task output is listed further down

It does not seems to matter how our split up the reindexing process, it still gets to a point where is stops processing them, with the same error

So there must be some queue/buffer/pool etc that is getting overloaded. Without knowing what I can't manage the process to "throttle" the reindexing accordingly or manage the process more appropriately.

Are you able to shed light on the error and what is causing that, or where I should look?

{
  "completed": true,
  "task": {
    "node": "91hv2jovSRqTYMy6Z2HmLw",
    "id": 19838950,
    "type": "transport",
    "action": "indices:data/write/reindex",
    "status": {
      "total": **485845**,
      "updated": 0,
      "created": **241538**,
      "deleted": 0,
      "batches": 242,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 0,
      "requests_per_second": -1,
      "throttled_until_millis": 0,
      "slices": [
        {
          "slice_id": 0,
          "total": 256307,
          "updated": 0,
          "created": 12000,
          "deleted": 0,
          "batches": 12,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 1,
          "total": 229538,
          "updated": 0,
          "created": 229538,
          "deleted": 0,
          "batches": 230,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        }
      ]
    },
    "description": "reindex from [dataset-*] updated with Script{type=inline, lang='painless', idOrCode='ctx._source.dataseriesId = ctx._source.remove(\"id\"); ctx._source.datasetId = ctx._index.substring(8);', options={}, params={}} to [measurement-data]",
    "start_time_in_millis": 1748607579641,
    "running_time_in_nanos": 10428953489,
    "cancellable": true,
    "cancelled": false,
    "headers": {}
  },
  "response": {
    "took": 10395,
    "timed_out": false,
    "total": 485845,
    "updated": 0,
    "created": 241538,
    "deleted": 0,
    "batches": 242,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
      "bulk": 0,
      "search": 0
    },
    "throttled": "0s",
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until": "0s",
    "throttled_until_millis": 0,
    "slices": [
      {
        "slice_id": 0,
        "total": 256307,
        "updated": 0,
        "created": 12000,
        "deleted": 0,
        "batches": 12,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 1,
        "total": 229538,
        "updated": 0,
        "created": 229538,
        "deleted": 0,
        "batches": 230,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      }
    ],
    "failures": [
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x0000000075577b18@2b2e6e04], creationTimeNanos=256586081264061, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = .eu-west-1.compute.internal/search, queue capacity = 1000, task execution EWMA = 132.7micros, total task execution time = 59.9m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 8, active threads = 8, queued tasks = 996, completed tasks = 15079963]]"
        }
      },
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x0000000075577b18@76fb38fd], creationTimeNanos=256586081617727, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = eu-west-1.compute.internal/search, queue capacity = 1000, task execution EWMA = 169.8micros, total task execution time = 59.9m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 8, active threads = 8, queued tasks = 1000, completed tasks = 15079973]]"
        }
      },
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x0000000075577b18@56e8baf], creationTimeNanos=256586081825979, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = .eu-west-1.compute.internal/search, queue capacity = 1000, task execution EWMA = 154.8micros, total task execution time = 59.9m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 8, active threads = 8, queued tasks = 998, completed tasks = 15079979]]"
        }
      },
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x0000000075577b18@2b57f01], creationTimeNanos=256586081962532, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = .eu-west-1.compute.internal/search, queue capacity = 1000, task execution EWMA = 122.3micros, total task execution time = 59.9m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 8, active threads = 8, queued tasks = 1000, completed tasks = 15079985]]"
        }
      },
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x0000000075577b18@7d1ae1df], creationTimeNanos=256586082086545, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = .eu-west-1.compute.internal/search, queue capacity = 1000, task execution EWMA = 89.1micros, total task execution time = 59.9m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 8, active threads = 8, queued tasks = 999, completed tasks = 15079993]]"
        }
      },
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x0000000075577b18@3bbe4739], creationTimeNanos=256586082206195, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = .eu-west-1.compute.internal/search, queue capacity = 1000, task execution EWMA = 80.6micros, total task execution time = 59.9m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 8, active threads = 8, queued tasks = 1000, completed tasks = 15079998]]"
        }
      },
      {
        "shard": -1,
        "status": 429,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda/0x0000000075577b18@509361de], creationTimeNanos=256586082311388, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = .eu-west-1.compute.internal/search, queue capacity = 1000, task execution EWMA = 106.6micros, total task execution time = 59.9m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4b158e16[Running, pool size = 8, active threads = 8, queued tasks = 999, completed tasks = 15080000]]"
        }
      }
    ]
  }
}

And it is / is not the same point every time?

The documents are pretty much all the same format, or did that evolve over time, getting more complex, or just more docs/day?

In the error is queued tasks = 1000 (or values thereabouts). That 1000 is the limit you are hitting, it's not good. Is significant garbage collection going on? What is stack/any monitoring telling you about the host - c6gd.2xlarge sounds pretty beefy, but maybe not beefy enough, and if I read spec right it's "only" 16GB. How much heap are you using ?

BUT, we are probably now past my limit of useful knowledge here. Maybe others can help more. Good luck.

I am not sure I understand what you are doing, but it does not seem to be what I suggested. Are you running a reindex against all old indices with a filter clause on timestamp and the redirecting it to a single index?

What I suggested was to create a script that processes one old index at a time and have this write to a number of time-based indices based on the @timestamp field in the events. You should be able to set the index name correctly through an ingest pipeline. This approach should cause a lot less load and I would be surprised if it fails.

This is an indication that you are overloading the cluster.