We've had an elasticsearch setup for a while now that has worked by reingesting all data from a relational database into a new index in denormalised form.
We now want to perform immediate updates to the data in elasticsearch, instead of needing to wait until the reingestion occurs for changes made in the relational database to appear in elasticsearch.
The current reingestion approach relies on using a large number of aliases (which I understand is an antipattern). Each supplier-customer combination in our data has an alias, and when the data for a particular supplier-customer has been reingested we update it's alias to point to the new index.
We're considering using external versioning to ensure eventual consistency of immediate updates. However, we can't figure out how to ensure that an immediate update to data isn't overwritten by reingestion. We're open to changing any part of the strategy, apart from we really want to reingest our data every day to ensure that the state of the relational database doesn't drift away from the state of elasticsearch.
The data we are putting into elastic search is a denormalised view of the data in the database, with a bunch of business rules applied to it for exception pricing for certain customers, product groupings etc.
Currently we're performing all that denormalisation once a day but we do want to go more real time: have the application write directly to the database and to elastic search (or wirte to the database, then precompute pricing from there)
I'm considering using that, combined with external versioning. An update to a single entity in the database would trigger multiple documents to be indexed into elastic search. If two of these updates happen at the same time for a given entity they can race each other and the changes by one update could be overwritten by the other. We're considering sending a timestamp in with updates so an earlier update attempting to overwrite a later one will throw a version conflict error. Then we can retry the update with latest data.
This all works fine except we expect there will eventually be some drift away from the system of record, just because of new features, wierd race conditions we haven't thought about, and so on. We would also like the capability to migrate to a new index if we want to update mappings. But we also want to minimise the amount of time updates will be 'invisible' for. We're OK with waiting for a full reindex for documents to be deleted (for now)
I'm thinking about using incremental reindexing as described in the previous blog post, or writing to both indexes while we're reindexing. Reading from both indexes would work using field collapsing on the _id field, but we do a lot of infinite scrolling using pagination so unfortunately that solution doesn't work.
happy to hear any thoughts or comments if the state of the art has moved on, or if there's other things I should investigate.
Thanks again!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.