Too many Deleted Docs

Hi,

We have a update heavy index with lot of nested documents. We are using elasticsearch 2.3 and I am concerned about the amount of deleted documents in the index. We are seeing upto 50% deleted documents. I have tried multiple merge settings with the Tiered Merge Policy. Here are my index settings.

"index": {
"codec": "best_compression",
"refresh_interval": "300s",
"number_of_shards": "20",
"translog": {
"flush_threshold_size": "2048mb",
"durability": "async"
},
"merge": {
"policy": {
"max_merge_at_once": "20",
"max_merged_segment": "15GB",
"expunge_deletes_allowed": "5",
"segments_per_tier": "20"
}
},
"gc_deletes": "1s",
"max_result_window": "200000",
"requests": {
"cache": {
"enable": "true"
}
},
"uuid": "erNEn9dVQGq6dcyavK11wA"
}

I see that there is some merge activity going on in the cluster in the back ground using the hot threads api. But still the deleted documents are really high percentage which is occupying disk space.

Please advice on what I should do to bring this number down without using the optimize / force merge ?

This is to be expected for a heavy-update use-case. I wouldn't worry about the percentage of deleted docs.

If you really really really need to get the number of deletions lower, you could set index.merge.policy.reclaim_deletes_weight to something like 3 (default is 2 and 3 is very high already, make sure to not go beyond 3). This will tell Elasticsearch to favor merges that have the higher number of deleted documents. But I would not expect it to suddenly reduce the number of deleted docs magically. If you have a heavy-update case then most likely all your segments have a significant number of deletes.

Hi @jpountz Thanks a lot for your response.

Please let me know if my understanding is correct, increasing reclaim_deletes_weight will favor segments with more deleted docs from the documentation. For the existing segments that have already reached max_merged_segment , will they be affected by this setting also?

The reason why I am worried about the deleted documents , is that they are taking up lot of space and I have read these awesome blogs http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html and https://www.elastic.co/blog/lucenes-handling-of-deleted-documents by Micheal.

Here are some of the reasons why I am worried about the deleted docs.

We have lot of nested type fields in the elasticsearch mapping which is creating lot of documents for each primary doc we have . We currently have about 20 billion docs(about 3.5 TB with deleted docs, about 2.2 TB without deleted docs ) in the index with 20 primary shards and the data is growing at a rapid pace. I am worried about reaching the 2 billion docs per shard with lot of deleted docs.

In terms of capacity planning , from @mikemccand blogs , it feels like we have to account for 1.5X disk space . One thing that I have read from the blog is that Tiered Merge Policy does not reclaim deleted documents from segments that have already reached max_merged_segment size. So I was just wondering for my use case, if it is better to use other merge policies like LogByteSizeMergePolicy to reduce the overall deleted documents ? But if TieredMergePolicy is the way to go, then we will account for the 1.5X disk space in terms of infrastructure.

Please advice.

Only segments whose size multiplied by the ratio of live documents is less than max_merged_segment/2 are eligible to be merged. So for instance if you have a segment of size max_merged_segment, it will only become eligible for merging once it reaches 50% deleted documents.

No, switching to a different merge policy is unlikely to help, quite the contrary since eg. LogByteMergePolicy does not try to select merges that have more deletions.

Thanks @jpountz for all your inputs. Will account for the 1.5X disk space / documents since this use case is a update heavy one.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.