Initial data load / full reindexing from MySQL to Elasticsearch

Hello,

I am trying to design an initial data load and full reindexing (in the event of document schema change) solution for my Elasticsearch service. Source of truth for all data is MySQL, search related data will be stored in ES. I am planning to listen to MySQL bin logs and use a CDC (Change data capture) tool to index any document that gets updated into ES. For the initial load (or full reindex), I am planning to turn on the CDC tool and then use bulk index API to bulk load any previous data into ES. Would use index aliases to seamlessly swap indices.

  1. Is this a standard/recommended approach?
  2. Is there a conditional bulk index option available in ES so that if a document already exists in ES, the bulk index only replaces the document if the last modified time on it is after the existing one?

Thanks!

Hey,

I cannot really talk about the first question, but answer the second. If you bulk index a document with the same ID as you indexed before, that document gets replaced, so you only have the one available that was modified the last.

Hope that answer the part of your question.

--Alex

@spinscale Thanks for answering my question. So in my case it is not guaranteed that the document being indexed later is the most recent copy. Wondering if there is a way to conditionally replace the document?

One way I can think of is read the document, if it exists then check the condition, if condition satisfied then index the document else not. But that seems inefficient.

I think we're not on the same page here. Let me reinforce this. If you index a document with an id 1 and then reindex another document with the id 1, then the first document will be gone and only the second one exists. So indeed, the guarantee is given.

Maybe you should take a look at the update API and using versioning when there are concurrent writes. See https://www.elastic.co/guide/en/elasticsearch/reference/7.3/optimistic-concurrency-control.html and https://www.elastic.co/guide/en/elasticsearch/reference/7.3/docs-update.html

Thanks @spinscale. Not sure if "Optimistic concurrency control" can help here, but it seems possible to search for a document (by id) and if it does not exist, then index it? Though bulk API doesn't seem viable for this case.

Another option I see, using upsert API with script option where we SKIP upsert if modified time on document is after the modified time that we currently see on the Mysql DB. From the docs it seems like I can use bulk API for upserts. What do you think about this option?

Thanks,
Priya

That is possible using put if absent by specifying the op_type, this also works using the bulk API.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.