Duplicate indexing behavior without _id

csanadpoda · April 24, 2025, 7:09pm

If I were to index documents without specifying a fixed _id, when duplicate documents appear, would they be created as duplicate entries in the index, or would ES recognize that there's already the same entry in the index so it wouldn't create a duplicate?

Specifically, in code, if I were to create an index like:

    def create_index(self, index):
        #self.es.indices.delete(index=release_series_pair, ignore_unavailable=True)
        self.es.indices.create(index=index, timeout="10000s",
            mappings={
                "properties": {
                    "name": {
                        "type": "text",
                    },
                    "path": {
                        "type": "text",
                    },
                    "content": {
                        "type": "text"
                    },
                    "text_embedding_3_small_emb": {
                        "type": "dense_vector"
                    },
                },
            },
            settings={
                'index': {
                    'number_of_replicas': 1
                }
            }
        )
        print(f"index created for {index}")

And then add a single document by calling (documents containing a single doc):

    def insert_documents(self, index, documents):
        operations = []
        for document in documents:
            operations.append({'index': {'_index': index}})
            doc = {
                "name": document["filename"],
                "path": document["path"],
                "content": document["content"],
                "text_embedding_3_small_emb": document["embedding"]
            }
            operations.append(doc)

        return self.es.bulk(operations=operations)

If I were to then run insert_documents() with the same document, would I have 2 documents in my index, or only one? What if I had specified an id like:

        operations.append({
          "index": {
            "_index": index,
            "_id":    doc_id
          }
        })

Basically, I'm trying to understand how to most easily make it so if my code fails while indexing, if I rerun the same code it'd just continue with the docs that were not indexed yet, instead of creating a duplicate for everything I've already indexed so far.

JD_Armada · April 24, 2025, 9:07pm

Hey @csanadpoda , Welcome to the community!

If you don't specify a fixed _id, Elasticsearch will generate one for you. So re-running your function multiple times with the same documents with no fixed _id, would create duplicate entries within your index. The best thing to do is to give your documents a fixed _id so that if your function fails, re-running the function would only update the existing documents and not create duplicate entries.

Your example above of appending the _id should work as long as they are unique values.

Topic		Replies	Views
Multiple documents with the same _id Elasticsearch	4	833	July 6, 2017
Exceptions for duplications Elasticsearch	2	303	July 6, 2017
Same document repeated in search results Elasticsearch	7	1847	July 6, 2017
_bulk not indexing all documents? Elasticsearch	5	364	July 6, 2017
Not specifying Id and replication characteristics Elasticsearch	5	355	July 6, 2017

Duplicate indexing behavior without _id

Related topics