We are running Elasticsearch 0.90.2 on Debian 7.0/OpenJDK7u3 (2 nodes
cluster).
From time to time, Elasticsearch stop responding and the issue looks
related to the Garbage Collector.
Here are the information we have collected when problems occur :
-The search threadpool hits the concurrent active items limit and the queue
limit (default values, ie 36 threads and 1000 slots in the queue).
-We have high rate of slow queries (>8 seconds)
-The garbage collector logs long passes (around 6 seconds)
-Clients get Rejected exceptions
All of this happens for several minutes (> 10) from time to time
Then everything get back to normal.
Logs are attached.
The values we have :
System total memory : 6GB
ES_HEAP_SIZE=3g
We are almost sure this issue comes from long GC run.
We are planning to change the GC for G1 (after upgrading to Java 7u25,
because this GC requires Java 7u4), but I've seen in this group one thread
saying it crashes:/
What can we do to prevent this behavior and run ES smoothly ?
If the search threadpool hits the limit maybe you have too many concurrent
searches. If that's the case, you'll probably have to just add nodes and/or
increase the number of replicas. Or, you can look at making your queries
faster, if that's possible.
G1 may help, I would try it and see how it goes.
Last but not least, I would look at what is consuming memory. Is it field
cache? Is it filter cache? I think nodes stats can tell you that, and you
could turn a few knobs there and limit memory usage:
You may also want to try out our SPM for Elasticsearch. It will show you
all sorts of metrics, from Garbage Collector and pool sizes to cache sizes.
I assume it would be very helpful in this particular case:
We are running Elasticsearch 0.90.2 on Debian 7.0/OpenJDK7u3 (2 nodes
cluster).
From time to time, Elasticsearch stop responding and the issue looks
related to the Garbage Collector.
Here are the information we have collected when problems occur :
-The search threadpool hits the concurrent active items limit and the
queue limit (default values, ie 36 threads and 1000 slots in the queue).
-We have high rate of slow queries (>8 seconds)
-The garbage collector logs long passes (around 6 seconds)
-Clients get Rejected exceptions
All of this happens for several minutes (> 10) from time to time
Then everything get back to normal.
Logs are attached.
The values we have :
System total memory : 6GB
ES_HEAP_SIZE=3g
We are almost sure this issue comes from long GC run.
We are planning to change the GC for G1 (after upgrading to Java 7u25,
because this GC requires Java 7u4), but I've seen in this group one thread
saying it crashes:/
What can we do to prevent this behavior and run ES smoothly ?
If the search threadpool hits the limit maybe you have too many concurrent
searches. If that's the case, you'll probably have to just add nodes and/or
increase the number of replicas. Or, you can look at making your queries
faster, if that's possible.
G1 may help, I would try it and see how it goes.
Last but not least, I would look at what is consuming memory. Is it field
cache? Is it filter cache? I think nodes stats can tell you that, and you
could turn a few knobs there and limit memory usage:
You may also want to try out our SPM for Elasticsearch. It will show you
all sorts of metrics, from Garbage Collector and pool sizes to cache sizes.
I assume it would be very helpful in this particular case: Elasticsearch Monitoring
We are running Elasticsearch 0.90.2 on Debian 7.0/OpenJDK7u3 (2 nodes
cluster).
From time to time, Elasticsearch stop responding and the issue looks
related to the Garbage Collector.
Here are the information we have collected when problems occur :
-The search threadpool hits the concurrent active items limit and the
queue limit (default values, ie 36 threads and 1000 slots in the queue).
-We have high rate of slow queries (>8 seconds)
-The garbage collector logs long passes (around 6 seconds)
-Clients get Rejected exceptions
All of this happens for several minutes (> 10) from time to time
Then everything get back to normal.
Logs are attached.
The values we have :
System total memory : 6GB
ES_HEAP_SIZE=3g
We are almost sure this issue comes from long GC run.
We are planning to change the GC for G1 (after upgrading to Java 7u25,
because this GC requires Java 7u4), but I've seen in this group one thread
saying it crashes:/
What can we do to prevent this behavior and run ES smoothly ?
If you update to ES 0.90.7 or later, you should be safe in using the G1 GC
collector, because the GNU Trove collections have been replaced by HPPC.
I agree, moving from OpenJDK to the latest Oracle Java (7u25 is known to be
stable) can help in erratic memory situations.
But first, you have to check what is reason why you allocate so much data
on the heap so that it can not be garbage collected. Maybe simply your
requirements are too high for just 2 nodes and you need to add more nodes.
Thank you for your help.
From your advice, we're gonna test ES 0.90.7+, Oracle JDK, G1 and look what
is consuming memory.
This will take a few days to setup the environment, I'll come back with
results when I have them.
About the load, our CPU are most of the time idle. Would having more memory
help to have a better behavior ?
About Oracle/OpenJDK, is there a real difference between the 2 products
behaviour with ES ? OpenJDK ships builtin into Debian, while OracleJDK
don't. We like the idea to simply apt-get upgrade the package to get the
latest patches (and Java is very often patched).
Vincent
We're
Le jeudi 14 novembre 2013 15:06:56 UTC+1, vincent miszczak a écrit :
Hi,
We are running Elasticsearch 0.90.2 on Debian 7.0/OpenJDK7u3 (2 nodes
cluster).
From time to time, Elasticsearch stop responding and the issue looks
related to the Garbage Collector.
Here are the information we have collected when problems occur :
-The search threadpool hits the concurrent active items limit and the
queue limit (default values, ie 36 threads and 1000 slots in the queue).
-We have high rate of slow queries (>8 seconds)
-The garbage collector logs long passes (around 6 seconds)
-Clients get Rejected exceptions
All of this happens for several minutes (> 10) from time to time
Then everything get back to normal.
Logs are attached.
The values we have :
System total memory : 6GB
ES_HEAP_SIZE=3g
We are almost sure this issue comes from long GC run.
We are planning to change the GC for G1 (after upgrading to Java 7u25,
because this GC requires Java 7u4), but I've seen in this group one thread
saying it crashes:/
What can we do to prevent this behavior and run ES smoothly ?
Vendors and distributors may patch OpenJDK for their purposes, and they
also recommend Java versions. It is up to you to get informed about the
best solution for you.
I always recommend to update to the latest Java 7 version, because of the
chance to get most known bugs fixed.
Please note that every once a while, new Java releases bring new challenges
to run Lucene/ES smoothly. There is unfortunately no "official"
certification process of finding a reliable JVM for Lucene/ES to mitigate
risks, only advise from best practice is available. In general, all JVMs
since version 6 should be able to run ES "somehow" (ie. without crashing).
We have upgraded OpenJDK 7 from update 3 to update 25 :
we have a 2 nodes cluster, and running u3 aside with u25 give
serialization errors, we needed to upgrade both hosts at the same time.
we got strange results with u25. We had large CPU usage and ES regularly
stopped responding because of this.
We have upgraded to Oracle JDK 7 update 45.
No more CPU issue, no problem for now, no more GC issues without more
tuning. We are still watching if GC behaves correctly but behaviour looks
much better.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.