Elasticsearch duplicates increase disk size

Hello.
I have indexed to the elasticsearch 7 milion + documents using the collisions handling by updating the _version value.
What I have expected was that the "_version" will be incremented every time the document with the same "_id" was attempted to be loaded and that the storage will be unchanged.
Is this the expected behaviour from the elasticsearch to store the collided document regardless?
Or the document is scheduled to be deleted in some time in the future deletion clean up operation?

logstash ingestion config
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "field_increment"
document_id => "%{fingerprint}"
}
}

#indexed collisions.json content
{"document": "value"}
{"document": "value"}
...

{
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"stats": {
"uuid": "MCTuaE2sShKybrl6VHAGIw",
"primaries": {
"docs": {
"count": 1,
"deleted": 517707
},
"store": {
"size_in_bytes": 354878026
},
"indexing": {
"index_total": 6600000,
"index_time_in_millis": 6711318,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
...

The storage will be changed when you update documents. The update operation in Elasticsearch will reindex a new document rather than overriding previous document. The stale documents will be cleaned only when merge happens. The size of storage will fluctuate, I think.

That is what I have thought.
Yesterday I was testing this by loading the same document to have versioning collisions causing the version increment.

Have a look:


The segment merges happened, index memory elasticsearch request cache dropped, but the Disk size continued to grow and remained the same:

Is there a way to reclaim that stored space somehow?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.