Fwd: Handling live updates while reindexing data

Hi ,

I have gone through link
https://www.elastic.co/blog/changing-mapping-with-zero-downtime for
reindexing with zero downtime.

There is no information for handling live updates going on into old index.

Solution thought were:

  1. Queue updated and deletes and update new index with these instruction
    when reindexing is done.
    (Issue i can see with this as it does not update old index, my current
    search queries will not be up to date).

  2. Keep performing live updates on old index as well as keep queueing. When
    i am done with reindex, reissue queues command to new index.
    (Issue in this , there can be data inconsistencies).

  3. I can't use old index for reindex into new index as my old document did
    not contain some new fields. I always will be needing to reindex from
    source of truth (sql). Again as this sql db is getting updated at a high
    rate, how i can reindex to new elasticsearch index?

It will be really helpful if i can get some pointers.

Thanks in advance.
Prannoy Mittal.

1 Like

In the past I've implemented option 2 and option 1. It honestly depends on what your users expect and can handle. In my case a little bit of going back in time wasn't a big deal if it was corrected in a few seconds and usually that is how long it took.

If you have an external source of truth you might want to have a look at Elasticsearch's version_type=external semantics. If you set up whatever system is syncing the truth source into Elasticsearch to sync to both the new and the old index and always send the version and version_type=external then it'll ignore updates that'd downgrade the document. You'd have to handle deletes because reindex gets a snapshot of the data at a point in time and won't notice deletes done to the source index.

If the external source of truth is fast enough then you can just rebuild the whole index from it instead. In my case it was several orders of magnitude faster and less resource intensive to rebuild from Elasticsearch itself but if you can get away with being able to refresh from the source of truth then it is probably worth it.

thanks @nik9000..using external type is really cool but in my case data fed into ES is combined from multiple tables in relational dbs into a single nested objects. Using last updated time of one table(least recently updated table) can lead to data inconsistencies in case of simultaneous partial updates of es document.