Improving performance using Reindex API

Hey All,

ES version: 2.3.2

I have couple of questions regarding Reindex API:

  1. I'm reindexing data using Reindex API, every time I copy data from one index to another the size of the document is almost as big as twice than the old index.

  2. What should be the value of size parameter to speed up the reindexing process?
    I'm using reindexing API on production using following steps:

Step1: Run a first pass:

POST /_reindex
{
   "source": {
      "index": "prod-index-1",
      "size": 3000
   },
   "dest": {
      "index": "prod-index-0"
   }
}

Step2: Block writes on "prod-index-1" (old index)

Step3: Run 2nd pass of reindeixing

POST /_reindex
{
   "conflicts" : "proceed",
   "source": {
      "index": "prod-index-1",
      "size": 3000
   },
   "dest": {
      "index": "prod-index-0",
      "version_type": "external"
   }
}

Step4: Check count of number of documents on both indices, if they are same then increase replicas to 1 and refresh interval to 1s.

Please let me know if above is the efficient process to use Reindexing API or there are any other parameters I can use to tune the performance.
Because second pass almost takes as same as the first pass, so let me know if there is any way to tune the performance?

Thanks,
Ayush Sangani

Is it writing as much data? I expect not, but what is the count it returns?

I do expect the second pass to take a while though - it is still performing the same query.

You might be able to avoid doing as much work on the second pass if your data has a nice, incrementing field like time.

Nope it doesn't write whole dataset, just updates the one which is modified which is good.

I believe it's still scanning the whole data set on second pass, which is why it's consuming more time.
Not sure what you mean by "incremental field" ?
We have _timestamp field in place which is by default used in elasticsearch and it should consider something like that when we use version_type: external.

Cool. I suspect the second pass is all query time then. At this point the
only thing to do is try it break up the query in such a way that it doesn't
select unmodified documents. Like if you had a modified time or something.

I've talked to a few es developers about doing more fancy things,
specifically catching changes and forwarding them along too, but the
infrastructure for that is a long way off yet.

Yeah I believe in the second pass it also has to look for documents which has been in deleted, to update the new index.
It's still better than what we had earlier, I'm glad that ES has native Reindex API in place which pretty much does the same thing what our custom Spark Job does.
Let me know if there are any open issues I can keep an eye on.

Also it would be really helpful if you can tell me what would be the safe value of size, so that it doesn't blow up the cluster. I am going to try with 5000 for the next reindex request.

@nik9000 Do you know why Reindex API doesn't carry over deleted documents in the second pass?

For ex: If during 1st pass a document is deleted from source index and it's not carried over to the destination index and gets indexed over there. So during second pass Reindex API should delete that document from destination index right ?

Let me know if you need any other information.

That just isn't a thing reindex does. It concentrates on doing the right thing during the copy process but doesn't try to remember when a document was deleted. Tracking deletes is fairly complex and would require some changes to Elasticsearch that haven't been completed yet (sequence numbers on operations I think). Even then reindex probably won't do this - it'll stay doing what it does. We'll likely build some other replication solution that understands sequence numbers and is actually for replication. Reindex can be used for replication but it really isn't very efficient at it.