Our use case:
- pretty big cluster - billions of docs
- we update documents in place
- data are not time-sliced as we often do retrieve and modify old documents
- Issue: over time we accumulated a lot of deleted documents in the indices; it is close to 20%
- we are on 1.6.x
It turns out that we have a few segments close to 5GB and using default settings elasticsearch doesn't want to merge them.
We'd like to be able to defragment the cluster to avoid wasting space, especially that the number of deleted docs grows over time.
I see two solutions here:
Change the merge policy to something like this
index.merge.policy.max_merged_segment: 20gb # 5gb is the default
index.merge.policy.reclaim_deletes_weight: 3.0 # 2 is the default
This should help us right now, but it will really push the issue in time, as when we accumulate 20% of deleted docs in these 20gb segments we'll have the same as right now.
- Manually optimize the indices using optimize API
We can make it a weekly or monthly job, but I'm afraid that the segments will grow unbounded this way and eventually we will kill the cluster performance.
Q1. I guess what we really want is some kind of an
_optimize which will turn e.g.
- 5 * 5gb shards with 20% of deleted
- into 4 * 5gb shards with 0% deleted
Q2. Is there any other way this usecase should be handled without reindexing?
Q3. Do big shards have any negative impact on the cluster?