Hi everyone, I’m trying to achieve real-time operations without affecting performance too much.
More in deep, I need to sync data from a main relational database to ES. I thought about using a message broker (or by using an ETL and by scanning it every X minutes to perform same operations on ES).
I faced the issue regarding ES near real-time.
In a scenario with just clients and ES indexes (without a sync with a RDBS): let’s suppose a refresh occurs at t2.
- A new document is indexed at t0
- The same document is deleted (or updated) at t1.
This would ends up with a non-existing document. This could be solved by accepting the ES near real-time behavior and let clients to get a not found result or by using a forced refresh before performing 1) or 2).
Question 1: If inserts are way more frequent (and in row) than deletes and updates, is it better to refresh before deletes/updates?
Question 2: If accepting a not found result would be feasible with just clients <—> ES, what about clients <—> RDBS <—> ES ?
The document would be correctly deleted from RDBS (which is real-time) and the same 2 operations would be enqueued using a message queue and will be get executed in row at unknown time on ES. This would always fail without an explicit refresh and there will be no consistency between RDBS and ES because the document will be indexed on ES. There wouldn’t be a way to let clients noticing about the fail because those two operations could have been performed a lot time before on RDBS.
I’ve also read https://www.elastic.co/blog/found-keeping-elasticsearch-in-sync#using-queues-to-manage-batches but it looks like it doesn’t face the issue.
The only way I found is just triggering an explicit refresh before deleting or updating a document
I hope to have explained my problem as much clear as possible