_id is not unique. is repeated in indexes addressed by a single alias

I experience the situation that index is having multiple same _id within the same index
I have version 7.17.0
the data are ingested by python requests lib via API
this way
the uniqueness is broken among multiple indexes within same pattern

POST /cnfm_test_history_norm_state/_bulk?filter_path=items.*.error
{"index":{"_id":"25510117"}}
{"field1":1,"field2":"2"}
...
GET cnfm_test_history_norm_state/_count
{
  "query": {
    "terms": {
      "_id": [ "25510117"] 
    }
  }
}

returns:

{
  "count" : 2,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  }
}

or in detail

GET cnfm_test_history_norm_state/_search
{
  "_source": ["_index"], 
  "query": {
    "terms": {
      "_id": [ "25510117"] 
    }
  }
}

see:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "cnfm_test_history_norm_state-000001",
        "_type" : "_doc",
        "_id" : "25510117",
        "_score" : 1.0,
        "_source" : { }
      },
      {
        "_index" : "cnfm_test_history_norm_state-000002",
        "_type" : "_doc",
        "_id" : "25510117",
        "_score" : 1.0,
        "_source" : { }
      }
    ]
  }
}

A document ID is only unique within a single index (as long as you do not use routing). There is no way to enforce uniqueness acros multiple indices, e.g. if you are using time-based indices and rollover. What you are seeing is therefore expected as the documents are stored in two separate indices.

If you need the ID to be unique you have to either use a single index (makes deletes and retention management more complex and expensive) or search for the document before updating it within your application (slow and tricky, especially with concurrent changes).

1 Like

yes this is our case, we are using time-based indices with rollover.
the use-case is merge of multiple streams (join) based on _id
we create _id in both data streams and we join data in elasticsearch then by just adressing it as _id

I did not expect this behavior but I understand it. We have to look for another way how to resolve our problem.
Just for clarification 2 data streams are quite big (up to 1TB / day) and the delay (time shift) between the streams might be up to 2 hours.
so storing this amount in memory is expensive so we decided to use this technique, but it looks as a bad approach.

@Christian_Dahlqvist thank you
please you mention a method based on delete /retention management

I looked into it and deletes don't seem to reduce the index size. They just mark documents as deleted.
To reduce the index size the _forcemerge method has to be executed. However, this method cannot be used against a live index.

Do I understand this correctly / Is there any other method to handle this?

Thank you

Correct. The deleted documents will be removed when the segment they are stored in is merged, which can take some time. Eventually they will be removed.

I would recommend letting Elasticsearch clean this up through normal merges, which will require more disk space. I can not think of any other method to get around this, so I would say this is the tradeoff you would need to make.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.