We are looking to migrate some of our business use cases from SQL Server to ElasticSearch and have come across a requirement to support merge-update functionality, i.e. create document if it doesn't exist, otherwise overwrite it with a new version.
We've accomplished this quite easily by constructing the document ID as a concatenation of several fields, but since document IDs are unique only within the scope of the index - and we have many weekly/monthly indices under the same alias - we cannot guarantee there won't be duplicates created in different indices.
For example, the first document written to the index on February 1st could be a duplicate of the last document written on January 31st - and when users query the alias that contains both indices they will see duplicate data.
So far we've come up with few partial solutions for this:
-
Use a single index - this would ensure uniqueness but will eventually grow out of control in terms of shard size and maintenance would be extremely difficult.
-
Make all writes to a single index (not time based), and periodically move data older than X days to monthly history indices. This ensures uniqueness within the time frame of the main index (which is an acceptable compromise), but there doesn't seem to be any way of atomically moving the data out of it to the monthly indices (using reindex and delete by query) - each time we move data out of the index, either users will see duplicate data or missing data.
Is there any 3rd option we might be missing?
Thanks,
Dan