How to defrag an index?

igor_k · December 9, 2015, 11:17am

Hi,

Our use case:

pretty big cluster - billions of docs
we update documents in place
data are not time-sliced as we often do retrieve and modify old documents
Issue: over time we accumulated a lot of deleted documents in the indices; it is close to 20%
we are on 1.6.x

It turns out that we have a few segments close to 5GB and using default settings elasticsearch doesn't want to merge them.

We'd like to be able to defragment the cluster to avoid wasting space, especially that the number of deleted docs grows over time.

I see two solutions here:

Change the merge policy to something like this

index.merge.policy.max_merged_segment: 20gb # 5gb is the default
index.merge.policy.reclaim_deletes_weight: 3.0 # 2 is the default

This should help us right now, but it will really push the issue in time, as when we accumulate 20% of deleted docs in these 20gb segments we'll have the same as right now.

Manually optimize the indices using optimize API _optimize?max_num_segments=1

We can make it a weekly or monthly job, but I'm afraid that the segments will grow unbounded this way and eventually we will kill the cluster performance.

Q1. I guess what we really want is some kind of an _optimize which will turn e.g.

5 * 5gb shards with 20% of deleted
into 4 * 5gb shards with 0% deleted

Q2. Is there any other way this usecase should be handled without reindexing?

Q3. Do big shards have any negative impact on the cluster?

jprante · December 9, 2015, 4:06pm

Q1: you may want to send an _optimize with only_expunge_deletes=true

Q2: leave deleted documents in the index and filter them out by a criteria at search time, or rearrange your index organization so old/unneeded indices can be dropped

Q3: yes

igor_k · December 9, 2015, 5:16pm

Thanks Jörg for the answers. Looks like, there is no way to merge 5 shards into 4 shards of similar size, right? You can only merge 5 segments into one big segment?

Ivan · December 9, 2015, 5:39pm

Each shard is an individual Lucene index. Each index is made up of small
segments, which are immutable and are merged from time to time. The number
of shards cannot be changed once an index has been created.

I cannot see your original question since this "mailing list" does not
always deliver emails.

Ivan

igor_k · December 9, 2015, 7:56pm

Instead of shards I meant segments. Edited the post.

nik9000 · December 9, 2015, 8:29pm

Segment merging always goes down to a single segment. I've run into this issue before, btw. Ultimately we lived with the delete overhead.

Calling _optimize can actually make the trouble worse because it makes even bigger segments which the merge scheduler wants to merge even less than it wants to merge the ones around 5gb.

Its something I've talked about with @mikemccand a few times but never came up with a good solution for.

igor_k · December 10, 2015, 10:05am

Thanks Guys, so probably we'll also need to leave with some overhead. We'll try at least to understand the merge policy in more details and maybe tweak it a but to much our use case.

Topic		Replies	Views
Remove deleted documents from large segments Elasticsearch	4	3084	July 5, 2017
High number of deleted docs in segment Elasticsearch	2	702	February 25, 2018
Adding shards to reduce average size Elasticsearch	1	382	July 6, 2017
Changing Merge Policy And Optimization Elasticsearch	4	830	July 6, 2017
Lots of deleted documents above 40% Elasticsearch	41	5982	September 26, 2017

How to defrag an index?

Related topics