Re-indexing a live index


I have a live index currently running with about 40M documents. I want to migrate from ES 1.X to ES 5.X with minimum downtime.

My problem is that the documents can be updated at anytime during the re-index so I am wondering what is the best way to re-index and have a minimal downtime as ES does not have an update timestamp that I can scan and scroll against.

When upgrading from two major releases ago, it is important to read the breaking changes from 1.x to 2.x, as well as from 2.x to 5.x

Elasticsearch is able to use indices created in the previous major version only. For instance, Elasticsearch 5.x can use indices created in Elasticsearch 2.x, but not those created in Elasticsearch 1.x or before.

If you are running an Elasticsearch 1.x cluster, you have two options:

How are you updating the documents? Indexing full new versions of the documents or using scripted updates? Are you using any parent-child relationships?

Thanks for the explanation but that is not my problem.

I am looking for best practices to re-index a live constantly updated index with minimum downtime. I do not think it matters if I re-index from 2.4 to 5 or directly from 1.7 to 5. My problem is that the documents might be updated during the re-index process.

Thats the problem, I use all of the above. I use scripted and partial updates and I do have parent-child relations. Though the children do not get updated.

I am thinking of updating the mapping on my ES 1.7 to add a new timestamp field and update all my services code (scripts and partial) to update that field as well. Then when re-indexing I can use that field to scan and scroll.

Not sure if that is the best approach.

That does indeed make it more complicated. If you could alter your application to add a last modified timestamp to all records that are updated or inserted before you start migrating, you could then start with a scan and scroll of all records that are missing this timestamp to get most of the data (?) migrated. Once this has completed you can perform further scan and scroll based on these timestamps to catch up with the latest changes. At some point you will probably need to need some downtime or start dual feeding though.

Thats what I was thinking, will keep this post open if someone comes up with a better idea.

You might be able to dual feed, but as you have parent-child relationships that would require hierarchies to be migrated on demand, which probably is more complicated.

What do you mean by "that would require hierarchies to be migrated on demand". As I said the children do not get updated once inserted and the ID of the parent and child do not change.

If you are creating or updating a child and the parent had not yet been migrated, you may need to migrate the parent at the same time as the child.

Oh, yes. Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.