Hi there, we're trying to decide if we need to trigger force merges on a regular basis. To do that, I was trying to understand how (and how often) merges are triggered by default. The piece of public documentation on merge does not tell me this; it just says "Smaller segments are periodically merged". Can anyone help me understand this?
It also seems like this information is not being displayed on purpose. I get that the defaults shouldn't be messed with casually, but I'm pretty frustrated that I can't easily know them.
A great blog about merges on Lucene indices Changing Bits: Visualizing Lucene's segment merges and Lucene's Handling of Deleted Documents | Elastic Blog.
I've never had to tweak merge settings in the last 6 years except in some rare cases the index.merge.scheduler.max_thread_count
(because the storage or CPU were slow/light).
The segments api and merges stats might help to understand the current state of your shards & segments.
In the link to the other thread you've found, you've already seen 2 links to source code of Elasticsearch.
I think the general assumption why those settings are not exposed is they can be considered internal implementation details and they can change in any version (e.g. Lucene version or changes on the policies Elasticsearch might introduce should be no concern to the final user).
A longer refresh interval should favor creating less shards (generally).
Forcemerges usually bring a lot of advantages with the cost of temporary extra space needed and resource consumption while executing them (as long as you'll never write to such index anymore).
I am aware I might not have fully answering your question but keep in mind ESTieredMergePolicy
just inherits from the TieredMergePolicy
of Lucene (elasticsearch/EsTieredMergePolicy.java at v8.1.0 · elastic/elasticsearch · GitHub)
Thanks for your answer! And those are good blog posts to read, I hadn't come across them before, I appreciate it!
I absolutely accept that merge settings don't need to be tweaked. I have no desire to touch them - especially since ForceMerge is available for people who need more frequent merging. But I don't know for sure if I am one of these people. Our ES6 cluster implements it, and we don't have any background on why the decision was made to do it. We're now upgrading to ES8 and trying to understand if we need to programmatically force merge. Not tweaking the settings, but knowing how merge is currently triggered would help me make that decision.
I did look into the Lucene source code and found the answers still quite confusing. It's possible that there isn't one answer, and that the trigger depends on a bunch of factors, but even that would've saved me some time if the merge docs just said so.
Segments API and merge stats in the index API sounds like the best direction for me to take, although I'd hesitate to rely on per-index information for this. I think what I'll end up doing is just a trial and error situation - keep an eye on query latency without implementing force merge, and see if that looks dangerous.
Thank you for taking the time to answer, I really appreciate it!
Glad it helps.
Remember to run forcemerge only in indices you're not writing again.
If you wish, you can integrate the forcemerge automatically with an ILM policy.
The only drawback of integrating it with an ILM policy is that you cannot control the time the forcemerge it will run. It's an outstanding enhancement Add the ability to specify when ILM step execution should occur · Issue #37325 · elastic/elasticsearch · GitHub
To be in full control on when it's run, I suggest to use Curator
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.