I experience the situation that index is having multiple same _id within the same index
I have version 7.17.0
the data are ingested by python requests lib via API
this way
the uniqueness is broken among multiple indexes within same pattern
POST /cnfm_test_history_norm_state/_bulk?filter_path=items.*.error
{"index":{"_id":"25510117"}}
{"field1":1,"field2":"2"}
...
A document ID is only unique within a single index (as long as you do not use routing). There is no way to enforce uniqueness acros multiple indices, e.g. if you are using time-based indices and rollover. What you are seeing is therefore expected as the documents are stored in two separate indices.
If you need the ID to be unique you have to either use a single index (makes deletes and retention management more complex and expensive) or search for the document before updating it within your application (slow and tricky, especially with concurrent changes).
yes this is our case, we are using time-based indices with rollover.
the use-case is merge of multiple streams (join) based on _id
we create _id in both data streams and we join data in elasticsearch then by just adressing it as _id
I did not expect this behavior but I understand it. We have to look for another way how to resolve our problem.
Just for clarification 2 data streams are quite big (up to 1TB / day) and the delay (time shift) between the streams might be up to 2 hours.
so storing this amount in memory is expensive so we decided to use this technique, but it looks as a bad approach.
@Christian_Dahlqvist thank you
please you mention a method based on delete /retention management
I looked into it and deletes don't seem to reduce the index size. They just mark documents as deleted.
To reduce the index size the _forcemerge method has to be executed. However, this method cannot be used against a live index.
Do I understand this correctly / Is there any other method to handle this?
Correct. The deleted documents will be removed when the segment they are stored in is merged, which can take some time. Eventually they will be removed.
I would recommend letting Elasticsearch clean this up through normal merges, which will require more disk space. I can not think of any other method to get around this, so I would say this is the tradeoff you would need to make.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.