Duplicate indexing behavior without _id

If I were to index documents without specifying a fixed _id, when duplicate documents appear, would they be created as duplicate entries in the index, or would ES recognize that there's already the same entry in the index so it wouldn't create a duplicate?

Specifically, in code, if I were to create an index like:

    def create_index(self, index):
        #self.es.indices.delete(index=release_series_pair, ignore_unavailable=True)
        self.es.indices.create(index=index, timeout="10000s",
            mappings={
                "properties": {
                    "name": {
                        "type": "text",
                    },
                    "path": {
                        "type": "text",
                    },
                    "content": {
                        "type": "text"
                    },
                    "text_embedding_3_small_emb": {
                        "type": "dense_vector"
                    },
                },
            },
            settings={
                'index': {
                    'number_of_replicas': 1
                }
            }
        )
        print(f"index created for {index}")

And then add a single document by calling (documents containing a single doc):

    def insert_documents(self, index, documents):
        operations = []
        for document in documents:
            operations.append({'index': {'_index': index}})
            doc = {
                "name": document["filename"],
                "path": document["path"],
                "content": document["content"],
                "text_embedding_3_small_emb": document["embedding"]
            }
            operations.append(doc)

        return self.es.bulk(operations=operations)

If I were to then run insert_documents() with the same document, would I have 2 documents in my index, or only one? What if I had specified an id like:

        operations.append({
          "index": {
            "_index": index,
            "_id":    doc_id
          }
        })

Basically, I'm trying to understand how to most easily make it so if my code fails while indexing, if I rerun the same code it'd just continue with the docs that were not indexed yet, instead of creating a duplicate for everything I've already indexed so far.

Hey @csanadpoda , Welcome to the community!

If you don't specify a fixed _id, Elasticsearch will generate one for you. So re-running your function multiple times with the same documents with no fixed _id, would create duplicate entries within your index. The best thing to do is to give your documents a fixed _id so that if your function fails, re-running the function would only update the existing documents and not create duplicate entries.

Your example above of appending the _id should work as long as they are unique values.