Optimize api working inconsistently...a bug

We are in the final throes of readying an Elastic cluster for a production
search application. After much performance work we ended up with eight VM
instances each with 54gb of memory and ten cores. 19.5 gb of this is
allocated for the elastic jvm heap. The index is about 50 million items. We
have very complex, faceted, queries and require sub-second response time.

The lucene-only system being replaced attained 250 mS response but that was
on solid-state discs and read-only access (the index is wholly replaced and
never updated).

The Elastic system with eight shards and two replicas can attain an average
response of 240 mS at 1200 requests per minute but only if we optimize to
one segment after indexing:

curl -XPOST "

http://srch-lv109:9200/products-20121126-140607/_optimize?max_num_segments=1
"
{"ok":true,"_shards":{"total":24,"successful":24,"failed":0}}

However (here is the bug) inspecting the index directories shows that the
call actually affected either none of the shards or only one or two on each
machine. And performance does not improve.

Issuing another optimize request usually DOES work. Performance improves.

But we cannot issues two requests each lasting 30 minutes in production due
to the overhead of this command and the fact that it prevents indexing (I
presume) while it is running.

Can anyone help? Is this a known issue? W'ere on version 0.20.0 RC1.

Thanks,
Randall McRee
r a n d a ll dot m c r e e at gmail dot com
SHOP MA

--

Hi Randall,

only same vague hints, since I'm unaware of the type of performance
measurements you have already taken.

From what it seems, you are running ES under the default segment merge
settings? Long optimize runs may happen because the Lucene indices segments
have grown too big.
Check http://www.elasticsearch.org/guide/reference/index-modules/merge.html
if you can get better results by reducing the value
in index.merge.policy.segments_per_tier
and/or index.merge.policy.max_merged_segment

Another factor, in my opinion, is the very large heap you configured. Have
you double checked - with the help of the bigdesk tool - how much of the
19.5 GB is used during optimize, and how many GC are encountered? When
optimizing, a lot of byte arrays are moved over the heap, so chances are
high that huge compacting GC runs occur in the meantime, due to certain
limitations in the concurrent mark/sweep garbage collecting algorithm.
Sometimes, the compacting garbage collection of several gigabytes can take
unpredictable time which is unfortunate. I wouldn't be too surprised if the
"optimize" duration of 30 minutes you mentioned are partly caused by a few
long GC runs.

You would have some options if you come to the conclusion that your memory
settings are not optimal: adjusting heap size, adjust the JVM parameters
for GC, setting mlockall() in case of paging(swapping).

Another trouble, I assume you have already checked, may come from the disk
write performance, such large segments are in high demand for high I/O
throughput.

And, just for the sake of completeness, check if your system is paging and
swaps part of the ES JVM. I doubt it, but, in spite of 54GB RAM, this can
happen. This will trash your overall ES performance.

Best regards,

Jörg

--

Hi Jorg,
Thanks for your reply. We've thoroughly "explored" the space of smaller jvm
heaps, less memory and so on. We have looked at your advice elsewhere in
this newsgroup and on your site. Apparently, we need so much heap because
of our extensive use of facets and the fact that our queries are very
complex. In early testing we used simple queries and this caused us to
vastly underestimate the amount of memory elastic search would need. We
started the performance exercise with twelve servers each with four cores
and 24gb of memory.

During load and latency testing I monitor BigDesk to try and "understand"
what is going on. GC has not been a big factor. Although it occurs often
each instantiation is short. Swapping does not happen.

If I use less than 19gb of heap I eventually see an OOM due to loading of
the "Price" facet. I guess this facet, being a float range takes the most
space?

All configurations I have tested that had less than 32gb of memory also
tended to have nodes "drop out" under load testing due to missed
heartbeats. We also set the jvm wrapper ping to 600mS since that could also
inadvertently kill a server.

We do need to experiment with the segment settings so that, perhaps, going
to 1 segment is quicker.

I have quiesced the cluster during optimize and it seems to do very little.
No GC. No swap. Just transport and read/write bytes activity.

I have also varied the number of shards when trying all of the various
configurations. First, I notice, that good throughput requires a balanced
configuration. One node which has more of the index than another will be a
bottleneck. More shards seems a bit more flexible in that you have smaller
pieces to deal out. In general, overall performance seems gated by what
proportion of the total index ends up on each server. Although this argues
for more servers we found it better to have fewer servers with more memory
(the total number of memory is fixed by the VM hosts). Whatever proportion
of the index you have you then need a certain amount of memory for the file
system cache in order to keep IOPs in a reasonable range, as well as
needing a proportionally sized field cache.

This morning, for example, we went from 54gb to 63gb per server. Other
factors stayed the same. Latency is still in the 250 to 300 millisecond
range--unchanged. But the cluster went from being able to handle a maximum
of 120 requests per second (1080mS latency) to 150 requests per second
(780 mS latency). Probably, this is also due to having a larger file system
cache. Each server has one-quarter of the index in either case. Another
configuration that we find works well is to have three-eighths of the index
on each server.

Thanks,
Randy

On Fri, Nov 30, 2012 at 12:20 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Randall,

only same vague hints, since I'm unaware of the type of performance
measurements you have already taken.

From what it seems, you are running ES under the default segment merge
settings? Long optimize runs may happen because the Lucene indices segments
have grown too big. Check
http://www.elasticsearch.org/guide/reference/index-modules/merge.html if
you can get better results by reducing the value
in index.merge.policy.segments_per_tier
and/or index.merge.policy.max_merged_segment

Another factor, in my opinion, is the very large heap you configured. Have
you double checked - with the help of the bigdesk tool - how much of the
19.5 GB is used during optimize, and how many GC are encountered? When
optimizing, a lot of byte arrays are moved over the heap, so chances are
high that huge compacting GC runs occur in the meantime, due to certain
limitations in the concurrent mark/sweep garbage collecting algorithm.
Sometimes, the compacting garbage collection of several gigabytes can take
unpredictable time which is unfortunate. I wouldn't be too surprised if the
"optimize" duration of 30 minutes you mentioned are partly caused by a few
long GC runs.

You would have some options if you come to the conclusion that your memory
settings are not optimal: adjusting heap size, adjust the JVM parameters
for GC, setting mlockall() in case of paging(swapping).

Another trouble, I assume you have already checked, may come from the disk
write performance, such large segments are in high demand for high I/O
throughput.

And, just for the sake of completeness, check if your system is paging and
swaps part of the ES JVM. I doubt it, but, in spite of 54GB RAM, this can
happen. This will trash your overall ES performance.

Best regards,

Jörg

--

--