Heap Allocation for Machines with 100GB+ Ram

I've read a lot of different guides for how much memory to allocate to
ElasticSearch, and everything points to 50% of available system memory
(with the rest left over for caching and whatnot). A post to the group in
February
(https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/wo5jU0-PQ3k)
mentions that under 31gb, the JVM can compress 64-bit pointers. But I'm
working on machines that have 148gb of memory. Is there any official
guidance as to how heap should be allocated on machines like this? Should I
opt to just start multiple nodes with <31gb Heap sizes to benefit from
pointer compression? Or should I still allocate 50% of system memory?

Also, looking at this blog
http://blog.sematext.com/2012/02/07/elasticsearch-poll/ kimchy mentions in
the comments that the reason for running multiple instances would be
smaller heaps. Since that's from a year ago, is that still the case?

On a somewhat related note, was the same_shard.host c.r.a setting removed?
I can't seem to set it via /_cluster/settings. I guess I could always just
set an awareness instead using an node parameter ...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The reason behind the 50:50 split between JVM and OS is letting the OS
enough space for efficient I/O. ES is fast because of OS file caching,
not just because of large heaps. There is no strict rule for 50:50, it
depends on your data you have to load on the machine.

So it would help to know what data you want to process in 148G with ES,
how does the workload look like? Index size? Indexing throughtput?
Queries? FIlters/Facets?

Generally speaking, if you don't have a clue how to size ES - just scale
your system from small to large. Start with one node on one machine,
configure a heap that is reasonable according to your expected workload.
Take the numbers over a significant period of time (few hours or days).
Then you might increase in steps, maybe 4G, 8G, 16G and repeat until
your measurements show that your ES node will get slower with your data.
Watch out for GC.

Advanced tuning like compressed OOPS are neglectable, especially with
148G RAM. Your primary challenge is to find the right JVM and the right
JVM settings. Using more than one JVM per machine will slow down
JVM-based applications on a machine.

Jörg

Am 31.05.13 15:41, schrieb Murad K.:

I've read a lot of different guides for how much memory to allocate to
Elasticsearch, and everything points to 50% of available system memory
(with the rest left over for caching and whatnot). A post to the group
in February
(Redirecting to Google Groups)
mentions that under 31gb, the JVM can compress 64-bit pointers. But
I'm working on machines that have 148gb of memory. Is there any
official guidance as to how heap should be allocated on machines like
this? Should I opt to just start multiple nodes with <31gb Heap sizes
to benefit from pointer compression? Or should I still allocate 50% of
system memory?

Also, looking at this blog
What Is Elasticsearch: Getting Started Tutorial for Beginners - Sematext kimchy
mentions in the comments that the reason for running multiple
instances would be smaller heaps. Since that's from a year ago, is
that still the case?

On a somewhat related note, was the same_shard.host c.r.a setting
removed? I can't seem to set it via /_cluster/settings. I guess I
could always just set an awareness instead using an node parameter ...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

first quickly, the cluster.routing.allocation.same_shard.host is still there, and simply set it to true.

regarding memory, I assume you wish to run large indices on this machine, potentially with high memory requirements on ES itself (like using filed data for sorting / faceting with high memory requirements), if that's the case then you probably better off running around 30gb heaps with 2-3 instances. I have seen cases where larger single instance heaps work well, but it ally depend.

if on the other hand you do not think / know ES will need so much heap, then just test. if it the indices are small as well, then you are probably abusing the machine.

On Fri, May 31, 2013 at 3:41 PM, Murad K. mraoul@gmail.com wrote:

I've read a lot of different guides for how much memory to allocate to
Elasticsearch, and everything points to 50% of available system memory
(with the rest left over for caching and whatnot). A post to the group in
February
(Redirecting to Google Groups)
mentions that under 31gb, the JVM can compress 64-bit pointers. But I'm
working on machines that have 148gb of memory. Is there any official
guidance as to how heap should be allocated on machines like this? Should I
opt to just start multiple nodes with <31gb Heap sizes to benefit from
pointer compression? Or should I still allocate 50% of system memory?
Also, looking at this blog
What Is Elasticsearch: Getting Started Tutorial for Beginners - Sematext kimchy mentions in
the comments that the reason for running multiple instances would be
smaller heaps. Since that's from a year ago, is that still the case?
On a somewhat related note, was the same_shard.host c.r.a setting removed?
I can't seem to set it via /_cluster/settings. I guess I could always just
set an awareness instead using an node parameter ...

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This system is still pretty new so I'm still getting a handle of what I can
do with it and how it's going to be used, so I don't have a reference as to
what load it can handle. But it's going to be used for analysis of logging
information, so it's going to receive a steady stream of new data and needs
to be able to search through a good amount of cumulative data. I think
sorting/faceting will become important eventually.

When you say large indices does that only mean indexes with a large number
of shards or also a large number of indexes with fewer shards? I thought it
was relatively equivalent to have 1 Index with 100 shards or 100 indices
with one shard. The reason I mention this is because these machines are
planned to have a very large number of relatively small indices (deleting
indices only when we start to run low on disk space, which should take
quite a while) and wasn't sure if that fell into the category you mentioned.

I think the answer seems to be to do some testing or observations and see
how things develop. In that regard, are there any articles/guides that go
over properly testing an elasticsearch cluster in regards to how it handles
load and how it uses the heap?

On Friday, May 31, 2013 9:00:15 PM UTC-4, kimchy wrote:

first quickly, the cluster.routing.allocation.same_shard.host is still
there, and simply set it to true.

regarding memory, I assume you wish to run large indices on this machine,
potentially with high memory requirements on ES itself (like using filed
data for sorting / faceting with high memory requirements), if that's the
case then you probably better off running around 30gb heaps with 2-3
instances. I have seen cases where larger single instance heaps work well,
but it ally depend.

if on the other hand you do not think / know ES will need so much heap,
then just test. if it the indices are small as well, then you are probably
abusing the machine.

On Fri, May 31, 2013 at 3:41 PM, Murad K. <mra...@gmail.com <javascript:>>wrote:

I've read a lot of different guides for how much memory to allocate to
Elasticsearch, and everything points to 50% of available system memory
(with the rest left over for caching and whatnot). A post to the group in
February (
Redirecting to Google Groups)
mentions that under 31gb, the JVM can compress 64-bit pointers. But I'm
working on machines that have 148gb of memory. Is there any official
guidance as to how heap should be allocated on machines like this? Should I
opt to just start multiple nodes with <31gb Heap sizes to benefit from
pointer compression? Or should I still allocate 50% of system memory?

Also, looking at this blog
What Is Elasticsearch: Getting Started Tutorial for Beginners - Sematext kimchy mentions
in the comments that the reason for running multiple instances would be
smaller heaps. Since that's from a year ago, is that still the case?

On a somewhat related note, was the same_shard.host c.r.a setting
removed? I can't seem to set it via /_cluster/settings. I guess I could
always just set an awareness instead using an node parameter ...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Murad,

A few quick thoughts:

  • Try G1 with a large heap, assuming you end up needing it, and the latest
    JVM from Oracle.
  • If you have issues with GC pauses, there is always Zing JVM. Ping me
    @sematext if you choose this route.
  • Use a monitoring tool that shows your heap usage and GC activity - could
    be something as simple as jstat or jconsole or visualvm for ad-hoc
    observation or something like SPM for more permanent for long-term
    observation and monitoring.

Otis

Solr & Elasticsearch Support - http://sematext.com/
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Friday, May 31, 2013 11:07:10 PM UTC-4, Murad K. wrote:

This system is still pretty new so I'm still getting a handle of what I
can do with it and how it's going to be used, so I don't have a reference
as to what load it can handle. But it's going to be used for analysis of
logging information, so it's going to receive a steady stream of new data
and needs to be able to search through a good amount of cumulative data. I
think sorting/faceting will become important eventually.

When you say large indices does that only mean indexes with a large number
of shards or also a large number of indexes with fewer shards? I thought it
was relatively equivalent to have 1 Index with 100 shards or 100 indices
with one shard. The reason I mention this is because these machines are
planned to have a very large number of relatively small indices (deleting
indices only when we start to run low on disk space, which should take
quite a while) and wasn't sure if that fell into the category you mentioned.

I think the answer seems to be to do some testing or observations and see
how things develop. In that regard, are there any articles/guides that go
over properly testing an elasticsearch cluster in regards to how it handles
load and how it uses the heap?

On Friday, May 31, 2013 9:00:15 PM UTC-4, kimchy wrote:

first quickly, the cluster.routing.allocation.same_shard.host is still
there, and simply set it to true.

regarding memory, I assume you wish to run large indices on this
machine, potentially with high memory requirements on ES itself (like using
filed data for sorting / faceting with high memory requirements), if that's
the case then you probably better off running around 30gb heaps with 2-3
instances. I have seen cases where larger single instance heaps work well,
but it ally depend.

if on the other hand you do not think / know ES will need so much heap,
then just test. if it the indices are small as well, then you are probably
abusing the machine.

On Fri, May 31, 2013 at 3:41 PM, Murad K. mra...@gmail.com wrote:

I've read a lot of different guides for how much memory to allocate to
Elasticsearch, and everything points to 50% of available system memory
(with the rest left over for caching and whatnot). A post to the group in
February (
Redirecting to Google Groups)
mentions that under 31gb, the JVM can compress 64-bit pointers. But I'm
working on machines that have 148gb of memory. Is there any official
guidance as to how heap should be allocated on machines like this? Should I
opt to just start multiple nodes with <31gb Heap sizes to benefit from
pointer compression? Or should I still allocate 50% of system memory?

Also, looking at this blog
What Is Elasticsearch: Getting Started Tutorial for Beginners - Sematext kimchy mentions
in the comments that the reason for running multiple instances would be
smaller heaps. Since that's from a year ago, is that still the case?

On a somewhat related note, was the same_shard.host c.r.a setting
removed? I can't seem to set it via /_cluster/settings. I guess I could
always just set an awareness instead using an node parameter ...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.