Same _id ends up duplicated across rollover indices behind a write alias — can this be prevented via template/ILM?

Hi all,

I’m indexing documents into Elasticsearch using a deterministic _id (SHA1 of email + normalized_context). I write to an alias that uses ILM rollover, so over time it creates backing indices like:

  • data-000073

  • data-000077

Problem: I’m seeing the same _id stored twice, but in different backing indices (e.g., one copy in data-000073 and another in data-000077). When I index again through the alias, the document is written to the current write index, so it doesn’t overwrite the older copy if it exists in a previous rolled-over index.

Questions:

  1. Is there any way (index template / ILM / alias setting) to enforce uniqueness of _id across all indices behind an alias (or a data stream), so that indexing via the alias overwrites the existing document even if it lives in an older backing index?

  2. If not possible, what’s the recommended approach to avoid disk growth from duplicates while still using rollover?
    (e.g., routing to fixed “bucket” indices based on hash prefix, periodic reindex+dedupe, or another pattern)

Any pointers or best practices would be appreciated.

Example of _id generation:

python

hash_input = f"{email}{email_context_str}"

doc_id = hashlib.sha1(hash_input.encode()).hexdigest()

                                document = {
                                    "_index": INDEX_NAME,
                                    "_id": doc_id,
                                    "_source": {
                                        "email": email,

Thanks!

No, this is not possible.

Why are you getting duplicates? Would it be possible to delete them at the source?

Are these exact complete duplicates? If they are, does each document have an event timestamp?

If you want to avoid duplicates, have a unique ID and a consistent timestamp you can use traditional timebased indices instead of rollover. When doing this each index covers a specific time period, e.g. a day or a month, and has a timestamp as part of the name to indicate this. Events are directed to the index that matches the event timestamp, so all events with the same ID and timestamp will go to the same index, resulting in updates if duplicates are received. You can use these kind of indices with ILM.

2 Likes