Duplicate document ids when using rolling indices

Hi,

We have ES v7.17 and started to use index rolling with ILM (with 40G size limit on index) and faced a document duplication issue that was described in the ILM updating record with existing Id on older index creates a new entry on new write index · Issue #61242 · elastic/elasticsearch · GitHub

We have documents that are frequently updated within few days after creation and after that they won't get modified.
So the duplication problem happens when and index is getting rolled and the first document entry gets written to the for e.g. index-000001 and the updates come later when the write index is already index-000002.
We set the document id ourselves so it's not generated by Elasticsearch.

I understand from the bug report that the issue is known but what would be a good recommendation to work around it ?

We have very large data sets that we query and the results lists are also very big (over 100k results).
Besides, our queries contain numerous terms aggregations (defined by the end user) for faceting purposes, and these queries can be quite frequent.

We can deduplicate the data when scrolling though it, but our application UI needs reliable stats for the item counts in the facets and for total number of search results.
So far we were relying on the doc_count results returned by the terms aggregations, but with the duplicates included this doesn't work any more.

So I'm rather tempted to solve this by cleaning up the data set before query, and one solution I ran across is:
How to Find and Remove Duplicate Documents in Elasticsearch | Elastic Blog

while this is for autogenerated IDs.
If the cleanup procedure runs frequently and restricted to shorter time range, the amount of deletes can be quite small.
What I'm bit concerned of is that if Elasticsearch doesn't delete the older documents right away, I'm not sure if the search would actually still return them before the actual deletion happens.
At the moment all rolled indices are "hot" so they're not read-only.

So, would someone have better idea or recommendation how to solve this ?
Thanks in advance!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.