I thought I should revisit this thread in case anyone else is repeating my
mistakes, which it turns out are multiple. On the bright side, I do seem
to have resolved my issues.
tl/dr, optimize was screwing me up, and the merge settings I thought I had
in place were not actually there/active. Once applied all is well.
First, the regular use of optimize?only_expunge_deletes. I did not realize
at first that this command would in fact ignore the max_merged_segment
parameter (I thought I had checked it at one point, but I must not have).
While max_merged_segment was set to 2 GB, I ended up with segments as large
as 17 GB. I reindexed everything one weekend to observe merge behaviour
better and clear these out, and it wasn't until those segments were almost
completely full of deleted docs that they were merged out (they finally
vanished overnight, so I'm not exactly sure what the tipping point was, but
I do know they were at around 4/5 deleted at one point). Clearly my use of
optimize was putting the system in a state that only additional optimize
calls could clean, making the cluster "addicted" to the optimize call.
Second, and this is the more embarrassing thing, my changed merge settings
had mostly not taken effect (or were reverted at some point). After
removing all of the large segments via a full reindex, I added nodes to get
the system to a stable point where normal merging would keep the deleted
docs in check. It ended up taking 5/6 nodes to maintain ~30% delete
equilibrium and enough memory to operate, which was 2-3 more nodes that I
really wanted to dedicate. I decided then to bump the max_merged_segment
up as per Nikolas's recommendation above (just returning it to the default
5 GB to start with), but noticed that the index merge settings were not
what I thought they were. Sometime, probably months ago when I was trying
to tune things originally, I apparently made a mistake, though I'm still
not exactly sure when/where. I had the settings defined in the
elasticsearch.yml file, but I'm guessing those are only applied to new
indices when they're created, not existing indices that already have their
configuration set? I know I had updated some settings via the API at some
point, but perhaps I had reverted them, or simply not applied them to the
index in question. Regardless, the offending index still had mostly
default settings, only the max_merged_segment being different at 2 GB.
I applied the settings above (plus the 5 GB max_merged_segment value) to
the cluster and then performed a rolling restart to let the settings take
effect. As each node came up, the deleted docs were quickly merged out of
existence and the node stabilized ~3% deleted. CPU spiked to 100% while
this took place, disk didn't seem to be too stressed (it reported 25%
utilization when I checked via iostat at one point), but once the initial
clean-up was done things settled down, and I'm expecting smaller spikes as
it maintains the lower deleted percentage (I may even back down the
reclaim_deletes_weight). I need to see how it actually behaves during
normal load during the week before deciding everything is completely
resolved, but so far things look good, and I've been able to back down to
only 3 nodes again.
So, I've probably wasted dozens of hours a hundreds of dollars of server
time resolving what was ultimately a self-inflicted problem that should
have been fixed easily months ago. So it goes.
On Thursday, December 4, 2014 11:54:07 AM UTC-5, Jonathan Foy wrote:
I do agree with both of you that my use of optimize as regular maintenance
isn't the correct way to do things, but it's been the only thing that I've
found that keeps the deleted doc count/memory under control. I very much
want to find something that works to avoid it.
I came to much the same conclusions that you did regarding the merge
settings and logic. It took a while (and eventually just reading the code)
to find out that though dynamic, the merge settings don't actually take
effect until a shard is moved/created (fixed in 1.4), so a lot of my early
work thinking I'd changed settings wasn't really valid. That said, my
merge settings are still largely what I have listed earlier in the thread,
though repeating them for convenience:
index.merge.policy.reclaim_deletes_weight: 6.0 <-- This one I know is
quite high, I kept bumping it up before I realized the changes weren't
taking effect immediately
I DO have a mess of nested documents in the type that I know is the most
troublesome...perhaps the merge logic doesn't take deleted nested documents
into account when deciding what segment to merge? Or perhaps since I have
a small max_merged_segment, it's like Nikolas said and those max sized
segments are just rarely reclaimed in normal operation, and so the deleted
doc count (and the memory they take up) grows. I don't have memory issues
during normal merge operations, so I think I may start testing with a
larger max segment size.
I'll let you know if I ever get it resolved.
On Wednesday, December 3, 2014 3:05:18 PM UTC-5, Govind Chandrasekhar
Your current setup doesn't look ideal. As Nikolas pointed out, optimize
should be run under exceptional circumstances, not for regular maintenance.
That's what the merge policy setting are for, and the right settings should
meet your needs, atleast theoretically. That said, I can't say I've always
heeded this advice, since I've often resorted to using only_expunge_deletes
when things have gotten out of hand, because it's an easy remedy to a large
I'm trying out a different set of settings to those Nikolas just pointed
out. Since my issue is OOMs when merges take place, not so much I/O, I
figured the issue is with one of two things:
- Too many segments are being merged concurrently.
- The size of the merged segments are large.
I reduced "max_merge_at_once", but this didn't fix the issue. So it had
to be that the segments being merged were quite large. I noticed that my
largest segments often formed >50% of each shard and had upto 30% deletes,
and OOMs occurred since when these massive segments were being "merged" to
expunge deletes, since it led to the amount of data on the shard almost
To remedy this, I've REDUCED the size of "max_merged_segment" (I can live
with more segments) and reindexed all of my data (since this doesn't help
reduced existing large segments). If I understand merge settings correctly,
this means that in the worst case scenario, the amount of memory used for
merging will be (max_marged_segment x max_merge_at_once) GB.
Since these settings don't apply retrospectively to existing large
segments, I've reindexed all of my data. All of this was done in the last
day or so, so I've yet to see how it works out, though I'm optimistic.
By the way, I believe "max_marged_segment" limits are not observed for
explicit optimize, so atleast in my setup, I'm going to have to shy away
from explicitly expunging deletes. It could be that in your case, because
of repeated explicit optimizes, or use of max_num_segments, coupled with
the fact that you have a lot of reindexing going on (that too with child
documents, since any change in any one of the child documents results in
all other child documents and the parent document being marked as deleted),
things have gotten particularly out of hand.
On 3 December 2014 at 06:29, Nikolas Everett nik...@gmail.com wrote:
On Wed, Dec 3, 2014 at 8:32 AM, Jonathan Foy the...@gmail.com wrote:
Interesting...does the very large max_merged_segment not result in
memory issues when the largest segments are merged? When I run my the
cleanup command (_optimize?only_expunge_deletes) I see a steep spike in
memor as each merge is completing, followed by an immediate drop,
presumably as the new segment is fully initialized and then the old ones
are subsequently dropped. I'd be worried that I'd run out of memory when
initializing the larger segments. That being said, I only notice the large
spikes when merging via the explicit optimize/only_expunge_deletes command,
the continuous merging throughout the day results in very mild spikes by
I don't see memory issues but I'm not really looking for them. Memory
usage has never been a problem for us. IO spikes were a problem the few
times I ran only_expunge_deletes.
I'm forming the opinion that calling _optimize is should be a pretty
remarkable thing. Like it should only be required when:
- You are done writing an index and will never touch it again and want
to save some space/make querying a bit faster.
- You are working around some funky bug.
- You've just built the index with funky merge settings that created a
bazillion segments but imported quickly.
- You shouldn't be calling it. Stop now. You've made a mistake.
I think that #1 and #3 aren't valid for only_expunge_deletes though. So
that leaves either - you are working around a bug or you are making a
In your case I think your mistake is taking the default merge
settings. Maybe. Or maybe that is a bug. I'm not sure. If it is a
mistake then you are in good company.
Also! only_expunge_deletes is kind of a trappy name - what it really
does is smash all the segments with deletes together into one big segment
making the max_merged_segment worse in the long run.
A steep spike in memory usage is probably not worth worrying about so
long as you don't see any full GCs done via stop the world (concurrent mode
failure). I'd expect to see more minor GCs during the spike and those are
stop the world but they should be pretty short. Elasticsearch should log
a WARNING or ERROR during concurrent mode failures. It also exposes
counters of all the time spent in minor and full GCs and you can jam those
into RRDtool to get some nice graphs. Marvel will probably do that for
you, I'm not sure. You can also use
jstat -gcutil <pid> 1s 10000 to get
it to spit out the numbers in real time.
I guess I could always add a single node with the higher settings and
just drop it if it becomes problematic in order to test (since, though
dynamic, prior to 1.4 the merge settings only take effect on shard
initialization if I remember correctly).
I'm pretty sure that is an index level settings. Also, I think there
was an issue with applying it live in some versions but I know its fixed in
1.4. I'm pretty sure you can trick your way around the issue by moving the
shard to another node. Its kind of fun.
Thanks for the advice though, I'll definitely try that.
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
To unsubscribe from this group and all its topics, send an email to
To view this discussion on the web visit
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/095957e9-5fa5-43f5-824e-fe0c65b2640a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.