Improving performance using Reindex API

ayushsangani · June 2, 2016, 10:14pm

Hey All,

ES version: 2.3.2

I have couple of questions regarding Reindex API:

I'm reindexing data using Reindex API, every time I copy data from one index to another the size of the document is almost as big as twice than the old index.
What should be the value of size parameter to speed up the reindexing process?
I'm using reindexing API on production using following steps:

Step1: Run a first pass:

POST /_reindex
{
   "source": {
      "index": "prod-index-1",
      "size": 3000
   },
   "dest": {
      "index": "prod-index-0"
   }
}

Step2: Block writes on "prod-index-1" (old index)

Step3: Run 2nd pass of reindeixing

POST /_reindex
{
   "conflicts" : "proceed",
   "source": {
      "index": "prod-index-1",
      "size": 3000
   },
   "dest": {
      "index": "prod-index-0",
      "version_type": "external"
   }
}

Step4: Check count of number of documents on both indices, if they are same then increase replicas to 1 and refresh interval to 1s.

Please let me know if above is the efficient process to use Reindexing API or there are any other parameters I can use to tune the performance.
Because second pass almost takes as same as the first pass, so let me know if there is any way to tune the performance?

Thanks,
Ayush Sangani

nik9000 · June 2, 2016, 10:43pm

Is it writing as much data? I expect not, but what is the count it returns?

I do expect the second pass to take a while though - it is still performing the same query.

You might be able to avoid doing as much work on the second pass if your data has a nice, incrementing field like time.

ayushsangani · June 3, 2016, 1:54am

Nope it doesn't write whole dataset, just updates the one which is modified which is good.

ayushsangani · June 3, 2016, 1:57am

I believe it's still scanning the whole data set on second pass, which is why it's consuming more time.
Not sure what you mean by "incremental field" ?
We have _timestamp field in place which is by default used in elasticsearch and it should consider something like that when we use version_type: external.

nik9000 · June 3, 2016, 2:10am

Cool. I suspect the second pass is all query time then. At this point the
only thing to do is try it break up the query in such a way that it doesn't
select unmodified documents. Like if you had a modified time or something.

I've talked to a few es developers about doing more fancy things,
specifically catching changes and forwarding them along too, but the
infrastructure for that is a long way off yet.

ayushsangani · June 3, 2016, 2:20am

Yeah I believe in the second pass it also has to look for documents which has been in deleted, to update the new index.
It's still better than what we had earlier, I'm glad that ES has native Reindex API in place which pretty much does the same thing what our custom Spark Job does.
Let me know if there are any open issues I can keep an eye on.

Also it would be really helpful if you can tell me what would be the safe value of size, so that it doesn't blow up the cluster. I am going to try with 5000 for the next reindex request.

ayushsangani · July 15, 2016, 6:49pm

@nik9000 Do you know why Reindex API doesn't carry over deleted documents in the second pass?

For ex: If during 1st pass a document is deleted from source index and it's not carried over to the destination index and gets indexed over there. So during second pass Reindex API should delete that document from destination index right ?

Let me know if you need any other information.

nik9000 · July 15, 2016, 7:15pm

That just isn't a thing reindex does. It concentrates on doing the right thing during the copy process but doesn't try to remember when a document was deleted. Tracking deletes is fairly complex and would require some changes to Elasticsearch that haven't been completed yet (sequence numbers on operations I think). Even then reindex probably won't do this - it'll stay doing what it does. We'll likely build some other replication solution that understands sequence numbers and is actually for replication. Reindex can be used for replication but it really isn't very efficient at it.

Topic		Replies	Views
How to reindex ElasticSearch quickly? Elasticsearch	14	4329	July 6, 2017
Improving performance of reindex API? Elasticsearch	7	12795	July 5, 2017
Reindex API performance Elasticsearch	3	4595	July 5, 2017
Update ES2 to ES6 , it is very slowly when we reindex , there have any batter suggest for reindex? Elasticsearch	10	944	September 21, 2018
No effect of "size" in 'query' while reindexing in elasticsearch Logstash	2	612	July 6, 2017

Improving performance using Reindex API

Related topics