90.5/7 OOM errors-- memory leak or GC problems?


(kakaner) #1

Hi all!

I set up an ES cluster a couple weeks ago dedicated to a specific search
and document pattern and have been experiencing problems with it since then.

Every 18-24 hours we need to restart our cluster because we run out of
heap. Either there's a memory leak or problems with GC. Here is an image of
the sample memory usage:

https://lh6.googleusercontent.com/-CniK9Tc1J5I/UqDRPlRf5HI/AAAAAAAAAJk/lFK5nYdMo9Q/s1600/Screen+Shot+2013-11-30+at+2.47.51+AM.png
Note: drops to 0 are cluster restarts

We deployed with JDK 1.7.u25 and v0.90.5. Relevant stats:

  • 4 nodes (AWS 2xlarge), 1 replica
  • 16G reserved heap
  • 15 shards per index, 25 indexes, only 11M docs, relatively uniformly
    distributed over indexes ( I know the allocation is overkill right now but
    we're preparing for a huge influx of data)
  • 200-500 searches/s
  • mlockall = true
  • Using the Java API in Scala
    wrapper.java.additional.1=-Delasticsearch-service
    wrapper.java.additional.2=-Des.path.home=%ES_HOME%
    wrapper.java.additional.3=-Xss256k
    wrapper.java.additional.4=-XX:+UseParNewGC
    wrapper.java.additional.5=-XX:+UseConcMarkSweepGC
    wrapper.java.additional.6=-XX:CMSInitiatingOccupancyFraction=75
    wrapper.java.additional.7=-XX:+UseCMSInitiatingOccupancyOnly
    wrapper.java.additional.8=-XX:+HeapDumpOnOutOfMemoryError
    wrapper.java.additional.9=-Djava.awt.headless=true

Things we then tried:

  1. Per this posthttp://jontai.me/blog/2013/06/esrejectedexecutionexception-rejected-execution-of-messagechannelhandler-requesthandler/ I
    updated to a fixed thread pool with unbounded queues. However I understand
    this wasn't necessary for 90.5? Nothing changed.
  2. Changed heap to 8G. Got worse.
  3. Downgraded JDK to 1.6u41 since it was working on another box. Nothing
    changed.
  4. Finally upgraded to 90.7 and 1.7.u45 per this use casehttps://groups.google.com/forum/#!searchin/elasticsearch/jvm$20heap/elasticsearch/tAZIC_ffAiU/n3wPpMu6FzgJ.
    Slightly better, now the graphs look like this (we can last 2-3 days
    without a restart):

https://lh4.googleusercontent.com/-1qv5RGtJZwU/UqDRWXee0xI/AAAAAAAAAJs/uIz22Fk_tKM/s1600/Screen+Shot+2013-12-05+at+12.09.32+PM.png
We have a QA setup that is not experiencing problems:

  • Identical document structure and query patterns
  • 5 shards per index, 500K total docs, ~10-50 searches/s
  • 4 nodes, medium instances, 1 replica
  • JDK 1.6.u41

I know it's hard to diagnose with just this information, but I was
wondering if anyone has seen something similar and/or if there's something
obvious setting I'm overlooking that I should be checking on. Do I simply
have not enough nodes? Is there any other information I can provide that
would help?

Thanks!
~Karen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eea5d1ab-e70a-447e-a5a8-4f2e6de210f4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

The graphs show that GC is working. Can you post more info about how the
queries look like and what messages appear when you run out of heap?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHwJShicC6HhbXArebR0VXYZXQfQBPZTZcdUd%3DCADNfHg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(kakaner) #3

Ah I wasn't clear-- This is what an extended view looks like. It'll GC less
and less effectively each time until it crosses the 75% mark and then races
until it runs out of heap. Then we restart. We ended up implementing
automatic rolling restarts of our cluster once the heap crosses 80% mark.

https://lh6.googleusercontent.com/-lU169R-1J3A/UqDan9D-rII/AAAAAAAAAKE/ehml5168nak/s1600/Screen+Shot+2013-12-05+at+2.55.56+PM.png
We looked for messages in the logs the first time around but couldn't find
any. We haven't let it quite crash since then...

We have 25 time-based indexes aliased to one name. 95% of our searches are match
all queries
across all the indexes using the alias, sometimes with
subtypes set. We use term filters heavily-- many times with 50-500 terms
specified, nested inside boolean filters with some other criteria.

Does this help?

On Thursday, December 5, 2013 2:44:26 PM UTC-5, Jörg Prante wrote:

The graphs show that GC is working. Can you post more info about how the
queries look like and what messages appear when you run out of heap?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7747c11f-1570-43eb-81a1-d6dfc38cfa59%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jason Wee) #4

Hi, you said term filters? Did you set the cache to true? if so, check the
cache use in the cluster and cache expire time. /Jason

On Fri, Dec 6, 2013 at 4:08 AM, kakaner kakaner@gmail.com wrote:

Ah I wasn't clear-- This is what an extended view looks like. It'll GC
less and less effectively each time until it crosses the 75% mark and then
races until it runs out of heap. Then we restart. We ended up implementing
automatic rolling restarts of our cluster once the heap crosses 80% mark.

https://lh6.googleusercontent.com/-lU169R-1J3A/UqDan9D-rII/AAAAAAAAAKE/ehml5168nak/s1600/Screen+Shot+2013-12-05+at+2.55.56+PM.png
We looked for messages in the logs the first time around but couldn't find
any. We haven't let it quite crash since then...

We have 25 time-based indexes aliased to one name. 95% of our searches are match
all queries
across all the indexes using the alias, sometimes with
subtypes set. We use term filters heavily-- many times with 50-500 terms
specified, nested inside boolean filters with some other criteria.

Does this help?

On Thursday, December 5, 2013 2:44:26 PM UTC-5, Jörg Prante wrote:

The graphs show that GC is working. Can you post more info about how the
queries look like and what messages appear when you run out of heap?

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7747c11f-1570-43eb-81a1-d6dfc38cfa59%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHO4itzKXxKWuG30p24cZETvu8u3KJaAkyErXWsS%2BjBrad%2B5dQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

Yes, term filter is the culprit. It is cached by default.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-term-filter.html

The more term filter are cached, the more your heap grows. You should
disable term filter caching to see if it works better.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEOs7SWVJuhfC1YFdSv%2ByxS3Sw5e2iSSy4SpYkQmuMx5g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6