If I were to index documents without specifying a fixed _id, when duplicate documents appear, would they be created as duplicate entries in the index, or would ES recognize that there's already the same entry in the index so it wouldn't create a duplicate?
Specifically, in code, if I were to create an index like:
def create_index(self, index):
#self.es.indices.delete(index=release_series_pair, ignore_unavailable=True)
self.es.indices.create(index=index, timeout="10000s",
mappings={
"properties": {
"name": {
"type": "text",
},
"path": {
"type": "text",
},
"content": {
"type": "text"
},
"text_embedding_3_small_emb": {
"type": "dense_vector"
},
},
},
settings={
'index': {
'number_of_replicas': 1
}
}
)
print(f"index created for {index}")
And then add a single document by calling (documents
containing a single doc):
def insert_documents(self, index, documents):
operations = []
for document in documents:
operations.append({'index': {'_index': index}})
doc = {
"name": document["filename"],
"path": document["path"],
"content": document["content"],
"text_embedding_3_small_emb": document["embedding"]
}
operations.append(doc)
return self.es.bulk(operations=operations)
If I were to then run insert_documents()
with the same document, would I have 2 documents in my index, or only one? What if I had specified an id like:
operations.append({
"index": {
"_index": index,
"_id": doc_id
}
})
Basically, I'm trying to understand how to most easily make it so if my code fails while indexing, if I rerun the same code it'd just continue with the docs that were not indexed yet, instead of creating a duplicate for everything I've already indexed so far.