I am building Elasticsearch solution. Data for indexing will come from Message queue and service will index that data to Elasticsearch.
Applications -> SQL DB
Applications -> Message Queue -> Indexer Service -> Elastic search
Application will insert data to database and enqueue message that data is inserted. Indexer service will listen to queue and index data to Elasticsearch. Expected size of Elasticsearch database is 150 GB that represents about 400 millions records in SQL database.
I am afraid that at it can happen that some of our data is corrupted, that means that change happened in sql database but wasn't indexed in Elasticsearch and we want know about it until someone detects a bug. I think it would be good idea to have ability to periodically(once a day) check if there is any corrupted data.
My idea is to run service that will read batch of entities from SQL database and same documents form Elasticsearch, than for each of those calculate hash and compare it. This process will be taking a while and it will consume a lot of network bandwith so I would like to optimize it or to find some other solution.
Does anybody else had experience with something like this or has ideas how to improve this solution?
Thank you!