Initial data load / full reindexing from MySQL to Elasticsearch

Priya_Gupta · September 24, 2019, 11:42pm

Hello,

I am trying to design an initial data load and full reindexing (in the event of document schema change) solution for my Elasticsearch service. Source of truth for all data is MySQL, search related data will be stored in ES. I am planning to listen to MySQL bin logs and use a CDC (Change data capture) tool to index any document that gets updated into ES. For the initial load (or full reindex), I am planning to turn on the CDC tool and then use bulk index API to bulk load any previous data into ES. Would use index aliases to seamlessly swap indices.

Is this a standard/recommended approach?
Is there a conditional bulk index option available in ES so that if a document already exists in ES, the bulk index only replaces the document if the last modified time on it is after the existing one?

Thanks!

spinscale · September 25, 2019, 12:29pm

Hey,

I cannot really talk about the first question, but answer the second. If you bulk index a document with the same ID as you indexed before, that document gets replaced, so you only have the one available that was modified the last.

Hope that answer the part of your question.

--Alex

Priya_Gupta · September 25, 2019, 4:20pm

@spinscale Thanks for answering my question. So in my case it is not guaranteed that the document being indexed later is the most recent copy. Wondering if there is a way to conditionally replace the document?

One way I can think of is read the document, if it exists then check the condition, if condition satisfied then index the document else not. But that seems inefficient.

spinscale · September 26, 2019, 1:53pm

I think we're not on the same page here. Let me reinforce this. If you index a document with an id 1 and then reindex another document with the id 1, then the first document will be gone and only the second one exists. So indeed, the guarantee is given.

Maybe you should take a look at the update API and using versioning when there are concurrent writes. See https://www.elastic.co/guide/en/elasticsearch/reference/7.3/optimistic-concurrency-control.html and https://www.elastic.co/guide/en/elasticsearch/reference/7.3/docs-update.html

Priya_Gupta · October 3, 2019, 10:48pm

Thanks @spinscale. Not sure if "Optimistic concurrency control" can help here, but it seems possible to search for a document (by id) and if it does not exist, then index it? Though bulk API doesn't seem viable for this case.

Another option I see, using upsert API with script option where we SKIP upsert if modified time on document is after the modified time that we currently see on the Mysql DB. From the docs it seems like I can use bulk API for upserts. What do you think about this option?

Thanks,
Priya

spinscale · October 7, 2019, 12:20pm

That is possible using put if absent by specifying the op_type, this also works using the bulk API.

system · November 4, 2019, 12:20pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic search BulkAsync after Reindex is overwriting the documents Elasticsearch	1	349	July 3, 2019
Handling updates from multiple sources Elasticsearch	6	2887	July 6, 2017
Reindexing Strategy Elasticsearch	3	1269	July 6, 2017
Update existing record in elasticsearch Elasticsearch	3	4498	June 16, 2017
Replace the indexed data with new data Logstash	11	13713	July 6, 2017

Initial data load / full reindexing from MySQL to Elasticsearch

Related topics