Garbage collector issues

Hi,

We are running Elasticsearch 0.90.2 on Debian 7.0/OpenJDK7u3 (2 nodes
cluster).
From time to time, Elasticsearch stop responding and the issue looks
related to the Garbage Collector.

Here are the information we have collected when problems occur :
-The search threadpool hits the concurrent active items limit and the queue
limit (default values, ie 36 threads and 1000 slots in the queue).
-We have high rate of slow queries (>8 seconds)
-The garbage collector logs long passes (around 6 seconds)
-Clients get Rejected exceptions
All of this happens for several minutes (> 10) from time to time

Then everything get back to normal.
Logs are attached.

The values we have :
System total memory : 6GB
ES_HEAP_SIZE=3g

We are almost sure this issue comes from long GC run.
We are planning to change the GC for G1 (after upgrading to Java 7u25,
because this GC requires Java 7u4), but I've seen in this group one thread
saying it crashes:/
What can we do to prevent this behavior and run ES smoothly ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello Vincent,

If the search threadpool hits the limit maybe you have too many concurrent
searches. If that's the case, you'll probably have to just add nodes and/or
increase the number of replicas. Or, you can look at making your queries
faster, if that's possible.

G1 may help, I would try it and see how it goes.

Last but not least, I would look at what is consuming memory. Is it field
cache? Is it filter cache? I think nodes stats can tell you that, and you
could turn a few knobs there and limit memory usage:

You may also want to try out our SPM for Elasticsearch. It will show you
all sorts of metrics, from Garbage Collector and pool sizes to cache sizes.
I assume it would be very helpful in this particular case:

Best regards,
Radu

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thu, Nov 14, 2013 at 6:06 AM, vincent miszczak <
vincent.miszczak@gmail.com> wrote:

Hi,

We are running Elasticsearch 0.90.2 on Debian 7.0/OpenJDK7u3 (2 nodes
cluster).
From time to time, Elasticsearch stop responding and the issue looks
related to the Garbage Collector.

Here are the information we have collected when problems occur :
-The search threadpool hits the concurrent active items limit and the
queue limit (default values, ie 36 threads and 1000 slots in the queue).
-We have high rate of slow queries (>8 seconds)
-The garbage collector logs long passes (around 6 seconds)
-Clients get Rejected exceptions
All of this happens for several minutes (> 10) from time to time

Then everything get back to normal.
Logs are attached.

The values we have :
System total memory : 6GB
ES_HEAP_SIZE=3g

We are almost sure this issue comes from long GC run.
We are planning to change the GC for G1 (after upgrading to Java 7u25,
because this GC requires Java 7u4), but I've seen in this group one thread
saying it crashes:/
What can we do to prevent this behavior and run ES smoothly ?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You probably also want to change to Oracle java as well, OpenJDK is not
recommended.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 15 November 2013 04:42, Radu Gheorghe radu.gheorghe@sematext.com wrote:

Hello Vincent,

If the search threadpool hits the limit maybe you have too many concurrent
searches. If that's the case, you'll probably have to just add nodes and/or
increase the number of replicas. Or, you can look at making your queries
faster, if that's possible.

G1 may help, I would try it and see how it goes.

Last but not least, I would look at what is consuming memory. Is it field
cache? Is it filter cache? I think nodes stats can tell you that, and you
could turn a few knobs there and limit memory usage:

Elasticsearch Platform — Find real-time answers at scale | Elastic

You may also want to try out our SPM for Elasticsearch. It will show you
all sorts of metrics, from Garbage Collector and pool sizes to cache sizes.
I assume it would be very helpful in this particular case:
Elasticsearch Monitoring

Best regards,
Radu

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thu, Nov 14, 2013 at 6:06 AM, vincent miszczak <
vincent.miszczak@gmail.com> wrote:

Hi,

We are running Elasticsearch 0.90.2 on Debian 7.0/OpenJDK7u3 (2 nodes
cluster).
From time to time, Elasticsearch stop responding and the issue looks
related to the Garbage Collector.

Here are the information we have collected when problems occur :
-The search threadpool hits the concurrent active items limit and the
queue limit (default values, ie 36 threads and 1000 slots in the queue).
-We have high rate of slow queries (>8 seconds)
-The garbage collector logs long passes (around 6 seconds)
-Clients get Rejected exceptions
All of this happens for several minutes (> 10) from time to time

Then everything get back to normal.
Logs are attached.

The values we have :
System total memory : 6GB
ES_HEAP_SIZE=3g

We are almost sure this issue comes from long GC run.
We are planning to change the GC for G1 (after upgrading to Java 7u25,
because this GC requires Java 7u4), but I've seen in this group one thread
saying it crashes:/
What can we do to prevent this behavior and run ES smoothly ?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If you update to ES 0.90.7 or later, you should be safe in using the G1 GC
collector, because the GNU Trove collections have been replaced by HPPC.

I agree, moving from OpenJDK to the latest Oracle Java (7u25 is known to be
stable) can help in erratic memory situations.

But first, you have to check what is reason why you allocate so much data
on the heap so that it can not be garbage collected. Maybe simply your
requirements are too high for just 2 nodes and you need to add more nodes.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi guys,

Thank you for your help.
From your advice, we're gonna test ES 0.90.7+, Oracle JDK, G1 and look what
is consuming memory.
This will take a few days to setup the environment, I'll come back with
results when I have them.

About the load, our CPU are most of the time idle. Would having more memory
help to have a better behavior ?

About Oracle/OpenJDK, is there a real difference between the 2 products
behaviour with ES ? OpenJDK ships builtin into Debian, while OracleJDK
don't. We like the idea to simply apt-get upgrade the package to get the
latest patches (and Java is very often patched).

Vincent

We're

Le jeudi 14 novembre 2013 15:06:56 UTC+1, vincent miszczak a écrit :

Hi,

We are running Elasticsearch 0.90.2 on Debian 7.0/OpenJDK7u3 (2 nodes
cluster).
From time to time, Elasticsearch stop responding and the issue looks
related to the Garbage Collector.

Here are the information we have collected when problems occur :
-The search threadpool hits the concurrent active items limit and the
queue limit (default values, ie 36 threads and 1000 slots in the queue).
-We have high rate of slow queries (>8 seconds)
-The garbage collector logs long passes (around 6 seconds)
-Clients get Rejected exceptions
All of this happens for several minutes (> 10) from time to time

Then everything get back to normal.
Logs are attached.

The values we have :
System total memory : 6GB
ES_HEAP_SIZE=3g

We are almost sure this issue comes from long GC run.
We are planning to change the GC for G1 (after upgrading to Java 7u25,
because this GC requires Java 7u4), but I've seen in this group one thread
saying it crashes:/
What can we do to prevent this behavior and run ES smoothly ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It depends on the OpenJDK release version. You have mentioned OpenJDK 7u3,
and this is an older version, maybe with open bugs affecting Lucene/ES.

OpenJDK forms the base for Oracle JDK
http://openjdk.java.net/projects/jdk7u/qanda.html

Vendors and distributors may patch OpenJDK for their purposes, and they
also recommend Java versions. It is up to you to get informed about the
best solution for you.

I always recommend to update to the latest Java 7 version, because of the
chance to get most known bugs fixed.

Please note that every once a while, new Java releases bring new challenges
to run Lucene/ES smoothly. There is unfortunately no "official"
certification process of finding a reliable JVM for Lucene/ES to mitigate
risks, only advise from best practice is available. In general, all JVMs
since version 6 should be able to run ES "somehow" (ie. without crashing).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Some feedback for the community.

We have upgraded to ES 0.90.7+, same GC problems.

We have upgraded OpenJDK 7 from update 3 to update 25 :

  1. we have a 2 nodes cluster, and running u3 aside with u25 give
    serialization errors, we needed to upgrade both hosts at the same time.
  2. we got strange results with u25. We had large CPU usage and ES regularly
    stopped responding because of this.

We have upgraded to Oracle JDK 7 update 45.
No more CPU issue, no problem for now, no more GC issues without more
tuning. We are still watching if GC behaves correctly but behaviour looks
much better.

Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e255ab6b-385f-47ee-a8f6-c5b95f12069d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.