We have got millions (about 50) of data objects stored in normal RDBMS. And we are trying to index these records as documents in ES.
All OK, we got job to do that and it looks doable.
One question we have is: How can we compare the data with transactional data that what we have in ES is same as we do in RDBMS?
We do have updates to data equal to amount of searches... this is why it becomes more as needed to see are we consistent to RDBMS data store at any specific time????
Any tool available that can help to verify data in indexes and compare the data with RDMS etc?
How would you address this situation if this is asked by customer?
Use a transactional message queue system in the between. It will make sure that messages are delivered to elasticsearch
Do auditing: run every night a batch which counts documents in the RDBMS and in elasticsearch. If anything is wrong, reindex what is missing.
Add a "try-catch" block around elasticsearch index call and anytime something is wrong just log it or send it to dead letter queue (again message queue system)
Better: combine all that... That's what I've doing in the past.
Seems a good option, but how efficiently we can check "what is missing?".
Are you proposing to check all primary keys in RDBMS and find documents in ES for same keys and see what is missing?
This can take long isn't it?
Secondly missing updates will still be a question even document exists in ES? Are we saying trust on messaging log that if message has been sent should have processed????
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.