We're seeing some issues with timeouts lasting 6 seconds in
ElasticSearch during what appear to be massive Garbage Collection
sessions. We have 4 nodes, and 7 shards, the index is only 4.3G
loaded into memory. We've allocated 10gigs of memory on each server
to ES and have followed the recommendations to:
elasticsearch.yml:
bootstrap.mlockall: true
ENV:
ES_MIN_MEM=10g
ES_MAX_MEM=10g
This is a CentOS 5 box, and I've set the /proc/sys/vm/swappiness to 0.
On one box I have completely disabled swap, but as a System
Administrator, this makes me really nervous. Disabling virtual memory
is not a production solution and doesn't appear to have fixed the
issue anyways.
I hacked together a Perl script to pull data from the ElasticSearch
nodes and output it to graphite or cacti. It's attached as
perf_elastic_search.pl.
I'm seeing some interesting behavior from the garbage collection at
the point this timeout occurs. The Graphlot-2h.png shows the garbage
collection time_ms spiking as the heap.used_bytes drops significantly.
This seems to be a pattern, see Graphlot-24h.png.
It seems once the node has more than 8gb in heap.used_bytes, it
garbage collects itself down to 2gb. However, the GC between these
points is relatively unconcerned by the expanding heap.
Is there a way to force the garbage collection to favor smaller, more
frequent GC than to simply do it all at once every 2-3 hours? This
Garbage collection results in nodes timing out their connections for 6
seconds every 2-3 hours.
I'm not a Java guy, so any nudges in the right direction would be appreciated.
We're seeing some issues with timeouts lasting 6 seconds in
Elasticsearch during what appear to be massive Garbage Collection
sessions. We have 4 nodes, and 7 shards, the index is only 4.3G
loaded into memory. We've allocated 10gigs of memory on each server
to ES and have followed the recommendations to:
elasticsearch.yml:
bootstrap.mlockall: true
ENV:
ES_MIN_MEM=10g
ES_MAX_MEM=10g
One question - are you sure that you have "ulimit -l unlimited" set for
the user running elasticsearch? If not, then the mlockall will be
ignored.
We're seeing some issues with timeouts lasting 6 seconds in
Elasticsearch during what appear to be massive Garbage Collection
sessions. We have 4 nodes, and 7 shards, the index is only 4.3G
loaded into memory. We've allocated 10gigs of memory on each server
to ES and have followed the recommendations to:
elasticsearch.yml:
bootstrap.mlockall: true
ENV:
ES_MIN_MEM=10g
ES_MAX_MEM=10g
One question - are you sure that you have "ulimit -l unlimited" set for
the user running elasticsearch? If not, then the mlockall will be
ignored.
We're seeing some issues with timeouts lasting 6 seconds in
Elasticsearch during what appear to be massive Garbage Collection
sessions. We have 4 nodes, and 7 shards, the index is only 4.3G
loaded into memory. We've allocated 10gigs of memory on each server
to ES and have followed the recommendations to:
elasticsearch.yml:
bootstrap.mlockall: true
ENV:
ES_MIN_MEM=10g
ES_MAX_MEM=10g
One question - are you sure that you have "ulimit -l unlimited" set for
the user running elasticsearch? If not, then the mlockall will be
ignored.
On Fri, 2012-02-17 at 11:41 +0100, Brad Lhotsky wrote:
Sorry, forgot to include that, yes, I have this in /etc/security/limits.conf:
elasticsearch - memlock unlimited
Sorry, just to repeat: are you sure that it is being applied? When you
start ES, does it immediately take up the full amount specified in
ES_MAX_MEM?
Have you graphed the jvm stats? Do you see the same massive GC
Sessions every few hours?
I haven't done so for a long time, but yes, I used to see those big GCs,
but they are fast enough so that they don't cause a problem. We're
currently running 0.17.9 in production, so if you're on a more recent
version, behaviour might have changed.
I'll use bigdesk to capture some data today, and come back to you
We're seeing some issues with timeouts lasting 6 seconds in
Elasticsearch during what appear to be massive Garbage Collection
sessions. We have 4 nodes, and 7 shards, the index is only 4.3G
loaded into memory. We've allocated 10gigs of memory on each server
to ES and have followed the recommendations to:
elasticsearch.yml:
bootstrap.mlockall: true
ENV:
ES_MIN_MEM=10g
ES_MAX_MEM=10g
This is a CentOS 5 box, and I've set the /proc/sys/vm/swappiness to
0.
On one box I have completely disabled swap, but as a System
Administrator, this makes me really nervous. Disabling virtual memory
is not a production solution and doesn't appear to have fixed the
issue anyways.
I hacked together a Perl script to pull data from the Elasticsearch
nodes and output it to graphite or cacti. It's attached as
perf_elastic_search.pl.
I'm seeing some interesting behavior from the garbage collection at
the point this timeout occurs. The Graphlot-2h.png shows the garbage
collection time_ms spiking as the heap.used_bytes drops
significantly.
This seems to be a pattern, see Graphlot-24h.png.
It seems once the node has more than 8gb in heap.used_bytes, it
garbage collects itself down to 2gb. However, the GC between these
points is relatively unconcerned by the expanding heap.
Is there a way to force the garbage collection to favor smaller, more
frequent GC than to simply do it all at once every 2-3 hours? This
Garbage collection results in nodes timing out their connections for
6
seconds every 2-3 hours.
I'm not a Java guy, so any nudges in the right direction would be
appreciated.
First, there is no problem with this size of heap (as mentioned by someone on this thread), people are running ES with 30gb happily. You mentioned you store the index in memory? Whats the config that you use?
On Friday, February 17, 2012 at 12:20 PM, Brad Lhotsky wrote:
Hello,
We're seeing some issues with timeouts lasting 6 seconds in
Elasticsearch during what appear to be massive Garbage Collection
sessions. We have 4 nodes, and 7 shards, the index is only 4.3G
loaded into memory. We've allocated 10gigs of memory on each server
to ES and have followed the recommendations to:
elasticsearch.yml:
bootstrap.mlockall: true
ENV:
ES_MIN_MEM=10g
ES_MAX_MEM=10g
This is a CentOS 5 box, and I've set the /proc/sys/vm/swappiness to 0.
On one box I have completely disabled swap, but as a System
Administrator, this makes me really nervous. Disabling virtual memory
is not a production solution and doesn't appear to have fixed the
issue anyways.
I hacked together a Perl script to pull data from the Elasticsearch
nodes and output it to graphite or cacti. It's attached as
perf_elastic_search.pl (http://perf_elastic_search.pl).
I'm seeing some interesting behavior from the garbage collection at
the point this timeout occurs. The Graphlot-2h.png shows the garbage
collection time_ms spiking as the heap.used_bytes drops significantly.
This seems to be a pattern, see Graphlot-24h.png.
It seems once the node has more than 8gb in heap.used_bytes, it
garbage collects itself down to 2gb. However, the GC between these
points is relatively unconcerned by the expanding heap.
Is there a way to force the garbage collection to favor smaller, more
frequent GC than to simply do it all at once every 2-3 hours? This
Garbage collection results in nodes timing out their connections for 6
seconds every 2-3 hours.
I'm not a Java guy, so any nudges in the right direction would be appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.