Cluster [271f47 ] Reindex api questions

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Can you write into the target index as soon as the reindex starts? I'm planning to write new data into both the source and target index while the reindexing is running. That way when the reindex is done I can just change the aliases for my live system. Also, is the throughput for this faster than the bulk index? Any ideas how much faster? We are on a Heroku Boxer Plan(4GB RAM).

So I tried this on my local box with a copy of our prod DB. Does the reindex start from the documents with the highest _id? Because when I queried it looked like the target index did not have the documents with the lowest _id.

Also, it looks like it made 5 shards for the target index. Is this configurable? I only have one shard for the source index.

Yep! Reindex is really just a bunch of Bulk requests which get sent in the background. The target index is just like any other index, so you can index/update/get/delete/search to it.

It may be slightly faster than reindexing, since all the traffic stays local to the cluster. But not much...the majority of the work is the actual indexing which will stay the same regardless.

The default reindex will be in basically random order. Internally, ES is just grabbing segments sequentially, and since doc's get mixed up during the merge process, the end result is effectively random ordering.

If you need a certain order, you can specify a sort field (search for "sort" on the Reindex docs page for the syntax)

If you reindex into a target that doesn't exist, that index is created according to the defaults (5 shards). You can override these defaults with Index Templates. Or you can just create the index manually with the required settings, the reindex into that.

Hey! You put a cluster ID in the title so I thought this was a cloud question and didn't read it too closely. Maybe don't do that if it is a general question? I think you'll get more responses. Anyway,

Sure. Reindex acts just like scrolling from one index, collecting the documents, and dumping into the next index. It does them in batches.

If you are fairly sure that you aren't going to have any collisions you can and should write to the index while reindex is running. If you think you might have collisions you should look into using "version_type": "external" and "conflicts": "proceed". The default behavior of reindex is to overwrite anytime the index contains the same document being written. Setting those two parameters is more like saying "only copy if there is something new". Kind of. It is worth reading about and experimenting with if you have time.

It ought to be the same. You might have to set the size parameter higher to get better throughput. When it was first released reindex used the default search size as its batch size (10) which is very very small. You can change it like this:

POST /_reindex
{
  "source": {
    "index": "source",
    "size": 1500
  },
  "dest": {
    "index": "dest"
  }
}

That'll improve performance. You can also run multiple reindexes in parallel - use a date range or something to slice them as you like.

It sorts by _doc by default. That just gets them in whatever order Lucene thinks is fastest. Mostly.

You should make the destination before starting the reindex. I mean, you can use templates if you like, but personally I'd manually make the index before the reindex.

Thanks for the helpful response. I'll make sure not to mention the cluster id for this forum. I've only been posting in the Elastic Cloud forum. From my testing locally, it seems faster. I got a 1 million documents in about 43 seconds. My box is beefier than what is on Heroku but I'm running a bunch of other crap on my box so I'm hoping it's fast on Heroku. I'll report back my findings.