Is there a max recommended size for shrunk shards?
We are on a 7.17.4 ES cluster with 10 data nodes and large volumes. We have one index that's kinda busy so we have an index template that defines 10/1 shards/replicas. Our ILM policy performs a rollover when a shard gets to be 50 GB (500 GB index size). After 5 days, the policy warm phase shrinks the shards to 5 (100 GB each) and removes the replica and sets it RO. We have no cold phase. It expires these particular indexes at 60 days. We have no deletes in our indexes, so I don't think that forced merges gain us anything, and in practice, enabling it caused several merge failures before we disabled it again.
In order to keep heap space usage under control (aiming to reduce shard count per data node), I'm considering modifying the warm phase to shrink this index down to 2 shards instead of 5, thus each shrunk shard would be 250 GB. Would shrinking them to 1 shard be advisable (each shrunk shard would be 500 GB) ?
Side note: sorting when listing shrunk shards is impossible with that extra 4 character hash that gets added. Is it possible to get rid of that hash? Or a fallback would be to set it to some static string?
I believe forcemerging down to a single segment can reduce heap usage. At least it used to in older versions. If that is still the case I would expect this to have a bigger impact on heap usage than reducing the number of shards. Given that you start out with 50GB shards I would probably test with forcemerging these down to a single segment and skip merging them. If you have a test environment I would recommend you restore a snapshot of an index and look at the reported memory usage. You can then see how this changes depending on whether you merge or forcemerge.
Thank you for that guidance. I will experiment with the merge down to 1 segment again on our dev cluster when transitioning to the warm phase.
Can I steer you back to the main part of my question? Is there anything fundamentally wrong with a shrink action resulting in 2 250GB RO shards? In 1 500 GB RO shard? Or is my premise wrong in that I won't actually reduce heap usage with this modification? (Because I thought the number of fields was a bigger factor in heap space usage than the size of the index, so reducing the number of shrunk indexes would reduce heap usage...is that wrong?)
Here is an old webinar that discusses the benefits of forcemerging down to a single segment in order to reduce heap pressure. As far as i know most of it is still valid. If it is not I hope someone from Elastic can help point to changes and improvements made since then.
I do not know if merging shards down to very large ones actually reduce heap usage by any significant amount. I am however aware of some drawbacks of doing this. The first is that it gets a LOT more difficult to forcemerge down to a single segment, which I believe has a significant impact on heap usage. You also end up with very large shards that get difficult to replicate on node failures and potential cluster instability. This is one of the reasons 50GB is often listed as a suitable max shard size. It is also worth noting that every query runs against a shard in a single thread (multiple shards can however be queried in parallel), so having very large shards can also have a negative impact on query performance.