Detect difference between elastic search and SQL databse

I am building Elasticsearch solution. Data for indexing will come from Message queue and service will index that data to Elasticsearch.

Applications -> SQL DB
Applications -> Message Queue -> Indexer Service -> Elastic search

Application will insert data to database and enqueue message that data is inserted. Indexer service will listen to queue and index data to Elasticsearch. Expected size of Elasticsearch database is 150 GB that represents about 400 millions records in SQL database.

I am afraid that at it can happen that some of our data is corrupted, that means that change happened in sql database but wasn't indexed in Elasticsearch and we want know about it until someone detects a bug. I think it would be good idea to have ability to periodically(once a day) check if there is any corrupted data.

My idea is to run service that will read batch of entities from SQL database and same documents form Elasticsearch, than for each of those calculate hash and compare it. This process will be taking a while and it will consume a lot of network bandwith so I would like to optimize it or to find some other solution.

Does anybody else had experience with something like this or has ideas how to improve this solution?

Thank you!

A few years ago, in JDBC river, I used checksums (CRC) to detect modified data. While this works "somehow" you should be aware this imposes some restrictions on the data set behavior:

  • is your data set allow to grow, or to be modified, or to shrink? All three cases must be recognized followed by an appropriate action.

  • your SQL database has a different organization (relational data model). This means, a select operation creates rows that have "ephemeral" identity, they can not be identified later (unless you rely on primary keys only). If you repeat the operation, you may get different order of rows (if no 'order by' clause is given) or count of rows. So what builds your documents must be specified somewhere aside of the SQL statement. This is the relational/object impedance problem you have to solve in your application.

  • assuming you don't use a simple 1:1 mapping where one row builds one doc, if you merge several rows into one Elasticsearch document, you still have to detect what a modified row set in the database means to the Elasticsearch document - will it shrink, expand, or even get deleted?

1 Like