Compressing and forcemerging past time-based indices

Thanks @DavidTurner. You mean to say one force-merge thread for an entire cluster? Or per shard?

That's a very good question, sorry I should have been more precise. It's one thread per node. So it's worth starting as many force-merges as you can as they'll run in sequence on each node, but in parallel across the whole cluster.


Excellent. Thanks David for the clarification. Yeah I would usually run forcemerge using curator tool that is invoked via cronjob on each data node.

Hi @Christian_Dahlqvist and @warkolm

Just wanted to update you about some amazing performance numbers I ended up getting due to applying best_compression on past Read-only monthly indices and later force-merging them.

Of the past 18 monthly indices that I compressed and force-merged, the minimum size reduction I achieved was a staggering 42.5% (an index of 515 GB came down to 297 GB) with the maximum being 45% (from 645 GB to 354 GB) . This has almost halved my storage requirements :tada: Heartfelt thanks !

However, I have one question. When the monthly index was 645 GB initially (the monthly index was created by re-indexing day-wise indices into monthly to improve query performance), I had opted for 12 shards so that each shard will have around ~40-55 GB in size. But now with best_compression, the size is almost down by 45% to 354 GB . My question is: should I reduce the number of shards from 12 to 6? Considering that these are already force merged, will shrinking from 12 to 6 help? Or that after shrinking, I've to again forcemerge?

Hey @Christian_Dahlqvist / @warkolm / @DavidTurner - will really appreciate some insights on this.

Apologies for multiple tags.

Tricky to say. 20-30GB isn't unreasonably small for shards so IMO you could just leave them at 12; OTOH there is usually some data duplicated across shards (e.g. the terms dictionary) so shrinking might save you some more space. You don't have to force-merge anything, and again it's tricky to say whether it'll improve things further. I don't think you'll find another 40% of space savings but it depends on the details of your data.

Sorry there's no definitive answer here, I don't think we can offer more guidance than to try it and see. You've certainly got space to experiment now :slight_smile:


Thanks a ton, David. Totally understand. That helps a lot. I was just looking for some guidance that is it worth trying or not. From your inputs, it seems it would be worthwhile to try shrinking and see. I agree that 20-30 GB isn't too small for shards and that there may not be much impact by shrinking. But the only reason I thought to shrink was - when searching past 18 months data (current month index + past 18 monthly indices), the query currently hits 19 indices * 12 shards = 228 shards. I thought that if I reduce to 6, the query will hit just 114 shards. That was my reasoning.

Of course, not expecting 40% space savings again. More than happy with the gains I've obtained :sweat_smile:

So I suppose, I can run curator to shrink shards from 12 to 6. And since the data is already force merged, I reckon each index will end up with 6 segments for primary instead of current 12.

Thanks for all your inputs.

That makes sense; searching fewer larger shards can be more efficient (because of how the data structures scale) but it can also be less efficient (e.g. uses fewer parallel threads). It all depends...

Shrinking doesn't necessarily adjust the segment count. If you currently have twelve 20GB shards each with a single segment then shrinking by a factor of two will almost certainly leave you with six 40GB shards each with two segments.

And that means, after shrinking, it will need to be force-merged if I want to reduce the segments to 1?

Got it. In that case, I think I'm better off with what I've now :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.