_id is not unique. is repeated in indexes addressed by a single alias

Petr.Simik · December 16, 2022, 8:12am

I experience the situation that index is having multiple same _id within the same index
I have version 7.17.0
the data are ingested by python requests lib via API
this way
the uniqueness is broken among multiple indexes within same pattern

POST /cnfm_test_history_norm_state/_bulk?filter_path=items.*.error
{"index":{"_id":"25510117"}}
{"field1":1,"field2":"2"}
...

GET cnfm_test_history_norm_state/_count
{
  "query": {
    "terms": {
      "_id": [ "25510117"] 
    }
  }
}

returns:

{
  "count" : 2,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  }
}

or in detail

GET cnfm_test_history_norm_state/_search
{
  "_source": ["_index"], 
  "query": {
    "terms": {
      "_id": [ "25510117"] 
    }
  }
}

see:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "cnfm_test_history_norm_state-000001",
        "_type" : "_doc",
        "_id" : "25510117",
        "_score" : 1.0,
        "_source" : { }
      },
      {
        "_index" : "cnfm_test_history_norm_state-000002",
        "_type" : "_doc",
        "_id" : "25510117",
        "_score" : 1.0,
        "_source" : { }
      }
    ]
  }
}

Christian_Dahlqvist · December 16, 2022, 8:18am

A document ID is only unique within a single index (as long as you do not use routing). There is no way to enforce uniqueness acros multiple indices, e.g. if you are using time-based indices and rollover. What you are seeing is therefore expected as the documents are stored in two separate indices.

If you need the ID to be unique you have to either use a single index (makes deletes and retention management more complex and expensive) or search for the document before updating it within your application (slow and tricky, especially with concurrent changes).

Petr.Simik · December 16, 2022, 8:30am

yes this is our case, we are using time-based indices with rollover.
the use-case is merge of multiple streams (join) based on _id
we create _id in both data streams and we join data in elasticsearch then by just adressing it as _id

I did not expect this behavior but I understand it. We have to look for another way how to resolve our problem.
Just for clarification 2 data streams are quite big (up to 1TB / day) and the delay (time shift) between the streams might be up to 2 hours.
so storing this amount in memory is expensive so we decided to use this technique, but it looks as a bad approach.

Petr.Simik · December 16, 2022, 9:49am

@Christian_Dahlqvist thank you
please you mention a method based on delete /retention management

I looked into it and deletes don't seem to reduce the index size. They just mark documents as deleted.
To reduce the index size the _forcemerge method has to be executed. However, this method cannot be used against a live index.

Do I understand this correctly / Is there any other method to handle this?

Thank you

Christian_Dahlqvist · December 16, 2022, 10:55am

Correct. The deleted documents will be removed when the segment they are stored in is merged, which can take some time. Eventually they will be removed.

I would recommend letting Elasticsearch clean this up through normal merges, which will require more disk space. I can not think of any other method to get around this, so I would say this is the tradeoff you would need to make.

system · January 13, 2023, 10:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unique id for multiple indices having same alias Elasticsearch	2	842	August 2, 2022
Multiple documents with the same _id Elasticsearch	4	790	July 6, 2017
Multiple documents with same _id Elasticsearch	5	11752	December 19, 2017
Duplicate _id/_type within the same index Elasticsearch	5	1004	July 6, 2017
Documents with duplicate _id in an index split by time? Elasticsearch	2	599	October 9, 2020

_id is not unique. is repeated in indexes addressed by a single alias

Related topics