We're seeing some issues with timeouts lasting 6 seconds in
ElasticSearch during what appear to be massive Garbage Collection
sessions. We have 4 nodes, and 7 shards, the index is only 4.3G
loaded into memory. We've allocated 10gigs of memory on each server
to ES and have followed the recommendations to:
This is a CentOS 5 box, and I've set the /proc/sys/vm/swappiness to 0.
On one box I have completely disabled swap, but as a System
Administrator, this makes me really nervous. Disabling virtual memory
is not a production solution and doesn't appear to have fixed the
I hacked together a Perl script to pull data from the ElasticSearch
nodes and output it to graphite or cacti. It's attached as
I'm seeing some interesting behavior from the garbage collection at
the point this timeout occurs. The Graphlot-2h.png shows the garbage
collection time_ms spiking as the heap.used_bytes drops significantly.
This seems to be a pattern, see Graphlot-24h.png.
It seems once the node has more than 8gb in heap.used_bytes, it
garbage collects itself down to 2gb. However, the GC between these
points is relatively unconcerned by the expanding heap.
Is there a way to force the garbage collection to favor smaller, more
frequent GC than to simply do it all at once every 2-3 hours? This
Garbage collection results in nodes timing out their connections for 6
seconds every 2-3 hours.
I'm not a Java guy, so any nudges in the right direction would be appreciated.