Reindex API performance

Hi, checking reindex api. Have 400GB index with 6 shards to play with and copying over with a changed template to have 20 shards.

Got 16 nodes in a cluster and noticed with index set to 16 shards elastic was not distributing shards evenly across the whole cluster, whereas with 20 it somehow managed to do distribute shards much more evenly.

Problem:
I don't exactly understand why am I getting speed around 1.5k docs/second when using Reindex API, whereas when I do read and write to the same cluster with logstash I can make it spinning at 4k docs/second.

Not sure if my cluster is a bottleneck. For instance when creating replicas on the mentioned 6 shards index it took only 1hour to produce 400GB new replicas. So I'd think my hw is kinda okayish here.

Cheers
Marcin

Reindex uses a relatively simplistic pull from the scroll, index in bulk process. It doesn't attempt to run those concurrently or parallelize the process.

You can usually speed up reindex in a few ways:

  1. Use a bigger batch size. Like this:
{
  "source": {
    "index": "foo",
    "size": 5000 <---- batch size
  },
  "dest": {
    "index": "bar"
  }
}
  1. Issuing multiple reindexes in parallel. Use a query in the "source" that slices that data somehow and just run N of them.

The reindex in 2.3 has a too-small default batch size so that is where I'd start.

I'll be investigating adding concurrent scroll processing to reindex for 5.0 which ought to speed it up quite a bit.

1 Like

Cool! Waiting for es5 then!