Hi all,
I’m indexing documents into Elasticsearch using a deterministic _id (SHA1 of email + normalized_context). I write to an alias that uses ILM rollover, so over time it creates backing indices like:
-
data-000073 -
data-000077
Problem: I’m seeing the same _id stored twice, but in different backing indices (e.g., one copy in data-000073 and another in data-000077). When I index again through the alias, the document is written to the current write index, so it doesn’t overwrite the older copy if it exists in a previous rolled-over index.
Questions:
-
Is there any way (index template / ILM / alias setting) to enforce uniqueness of
_idacross all indices behind an alias (or a data stream), so that indexing via the alias overwrites the existing document even if it lives in an older backing index? -
If not possible, what’s the recommended approach to avoid disk growth from duplicates while still using rollover?
(e.g., routing to fixed “bucket” indices based on hash prefix, periodic reindex+dedupe, or another pattern)
Any pointers or best practices would be appreciated.
Example of _id generation:
python
hash_input = f"{email}{email_context_str}"
doc_id = hashlib.sha1(hash_input.encode()).hexdigest()
document = {
"_index": INDEX_NAME,
"_id": doc_id,
"_source": {
"email": email,
Thanks!