Reindexing is failing

We are trying to reindex 2+Tb indexed data between 2 clusters:
Cluster size: 6
Reindexing commands:

curl -XPOST "http://localhost:9200/_reindex?scroll=5m&wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
  "source": {
    "remote": {
      "host": "http://elasticsearch-xyz-cluster-master:9200",
      "socket_timeout": "5m",
      "connect_timeout": "5m"
    },
    "index": "xyz_index"
  },
  "dest": {
    "index": "xyz_index_2"
  },
  "conflicts": "proceed"
}'

We are noticing some blocks here. Reindexing is failing at some level. Can we adjust some cluster settings to accomplish this?

[2021-02-24T17:40:52,810][INFO ][o.e.t.LoggingTaskListener] [elasticsearch-master-0] 5328 finished with response BulkByScrollResponse[took=6.1h,timed_out=false,sliceId=null,updated=0,created=178506000,deleted=0,batches=178506,versionConflicts=0,noops=0,retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[{"shard":-1,"status":429,"reason":{"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@588a1b02 on QueueResizingEsThreadPoolExecutor[name = elasticsearch-master-4/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 512.4ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@78c3882f[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 4424686]]"}}, {"shard":-1,"status":429,"reason":{"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5ff927f7 on QueueResizingEsThreadPoolExecutor[name = elasticsearch-master-4/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 512.4ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@78c3882f[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 4424686]]"}}, {"shard":-1,"status":429,"reason":{"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@1a3609dd on QueueResizingEsThreadPoolExecutor[name = elasticsearch-master-4/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 512.4ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@78c3882f[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 4424686]]"}}]]

I gave try with increasing the CPU count from 8-16, but still no use.

Is that a single 2TB index?

Yes @warkolm

You could take a look at this short how to I wrote... Migrate your index using external tool like Logstash might provide a little more resiliency.

I wrote it to migrate data from on prem to cloud but it will work between 2 local clusters as well.

TLDR; use Logstash.

Good luck. We have never been able to get a similarly large reindex job to actually complete using ES native commands. We have a few multi-terrabyte indexes as well, and working with them can be quite difficult. If you have an opportunity to redesign your indexes, I would highly recommend splitting them up into smaller indexes and grouping them with an alias.

Per the specifics of reindexing never being able to complete on a large index on a production cluster, there is this thread I've been tracking for years:

Per @stephenb 's suggestion, I wouldn't even waste time with elasticdump, it is incredibly slow, especially at the data sizes you are looking at. We have looked at using Logstash various times, however the inconsistencies in how the input and output elasticsearch plugins have to be configured is frustrating, and I believe there was an issue that prevented us from moving forward with logstash because we didn't have the necessary options on either the input or output cluster. That said, if you can get logstash to connect to both of your clusters, that is probably your best bet.

We don't have logstash set up in our production. Is there any other ways by adjusting elasticsearch settings to get this reindex go well?

I think this may helpful: Thread pools | Elasticsearch Reference [master] | Elastic

Please suggest.

@mouli_v ,

Not in my experience. I have a couple "smaller" indexes that are in the 600gb - 800gb range and reindexing on those is inconsistent. Our production cluster must be very lightly loaded for those to succeed.

Another suggestion to throw out there would be standing up a clone of your production cluster, snapshoting your production index, and loading it into the clone, and then doing the reindex on that cluster which has no other demands on it. Then you can snapshot from the clone and reload into production. Personally, I'd rather stand up a logstash instance than do all that, but it is another option that might work for you.

While doing the reindex I would also suggest that you try to use ILM to create multiple smaller indexes to help avoid this problem in the future. That is our plan for managing some of our largest indexes when we move forward with a big upgrade in the coming months.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.