90.5/7 OOM errors-- memory leak or GC problems?

kakaner · December 5, 2013, 7:26pm

Hi all!

I set up an ES cluster a couple weeks ago dedicated to a specific search
and document pattern and have been experiencing problems with it since then.

Every 18-24 hours we need to restart our cluster because we run out of
heap. Either there's a memory leak or problems with GC. Here is an image of
the sample memory usage:

https://lh6.googleusercontent.com/-CniK9Tc1J5I/UqDRPlRf5HI/AAAAAAAAAJk/lFK5nYdMo9Q/s1600/Screen+Shot+2013-11-30+at+2.47.51+AM.png
Note: drops to 0 are cluster restarts

We deployed with JDK 1.7.u25 and v0.90.5. Relevant stats:

4 nodes (AWS 2xlarge), 1 replica
16G reserved heap
15 shards per index, 25 indexes, only 11M docs, relatively uniformly
distributed over indexes ( I know the allocation is overkill right now but
we're preparing for a huge influx of data)
200-500 searches/s
mlockall = true
Using the Java API in Scala
wrapper.java.additional.1=-Delasticsearch-service
wrapper.java.additional.2=-Des.path.home=%ES_HOME%
wrapper.java.additional.3=-Xss256k
wrapper.java.additional.4=-XX:+UseParNewGC
wrapper.java.additional.5=-XX:+UseConcMarkSweepGC
wrapper.java.additional.6=-XX:CMSInitiatingOccupancyFraction=75
wrapper.java.additional.7=-XX:+UseCMSInitiatingOccupancyOnly
wrapper.java.additional.8=-XX:+HeapDumpOnOutOfMemoryError
wrapper.java.additional.9=-Djava.awt.headless=true

Things we then tried:

Per this posthttp://jontai.me/blog/2013/06/esrejectedexecutionexception-rejected-execution-of-messagechannelhandler-requesthandler/ I
updated to a fixed thread pool with unbounded queues. However I understand
this wasn't necessary for 90.5? Nothing changed.
Changed heap to 8G. Got worse.
Downgraded JDK to 1.6u41 since it was working on another box. Nothing
changed.
Finally upgraded to 90.7 and 1.7.u45 per this use casehttps://groups.google.com/forum/#!searchin/elasticsearch/jvm$20heap/elasticsearch/tAZIC_ffAiU/n3wPpMu6FzgJ.
Slightly better, now the graphs look like this (we can last 2-3 days
without a restart):

https://lh4.googleusercontent.com/-1qv5RGtJZwU/UqDRWXee0xI/AAAAAAAAAJs/uIz22Fk_tKM/s1600/Screen+Shot+2013-12-05+at+12.09.32+PM.png
We have a QA setup that is not experiencing problems:

Identical document structure and query patterns
5 shards per index, 500K total docs, ~10-50 searches/s
4 nodes, medium instances, 1 replica
JDK 1.6.u41

I know it's hard to diagnose with just this information, but I was
wondering if anyone has seen something similar and/or if there's something
obvious setting I'm overlooking that I should be checking on. Do I simply
have not enough nodes? Is there any other information I can provide that
would help?

Thanks!
~Karen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eea5d1ab-e70a-447e-a5a8-4f2e6de210f4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · December 5, 2013, 7:44pm

The graphs show that GC is working. Can you post more info about how the
queries look like and what messages appear when you run out of heap?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHwJShicC6HhbXArebR0VXYZXQfQBPZTZcdUd%3DCADNfHg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

kakaner · December 5, 2013, 8:08pm

Ah I wasn't clear-- This is what an extended view looks like. It'll GC less
and less effectively each time until it crosses the 75% mark and then races
until it runs out of heap. Then we restart. We ended up implementing
automatic rolling restarts of our cluster once the heap crosses 80% mark.

https://lh6.googleusercontent.com/-lU169R-1J3A/UqDan9D-rII/AAAAAAAAAKE/ehml5168nak/s1600/Screen+Shot+2013-12-05+at+2.55.56+PM.png
We looked for messages in the logs the first time around but couldn't find
any. We haven't let it quite crash since then...

We have 25 time-based indexes aliased to one name. 95% of our searches are match
all queries across all the indexes using the alias, sometimes with
subtypes set. We use term filters heavily-- many times with 50-500 terms
specified, nested inside boolean filters with some other criteria.

Does this help?

On Thursday, December 5, 2013 2:44:26 PM UTC-5, Jörg Prante wrote:

The graphs show that GC is working. Can you post more info about how the
queries look like and what messages appear when you run out of heap?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7747c11f-1570-43eb-81a1-d6dfc38cfa59%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jason_Wee · December 6, 2013, 7:08am

Hi, you said term filters? Did you set the cache to true? if so, check the
cache use in the cluster and cache expire time. /Jason

On Fri, Dec 6, 2013 at 4:08 AM, kakaner kakaner@gmail.com wrote:

Ah I wasn't clear-- This is what an extended view looks like. It'll GC
less and less effectively each time until it crosses the 75% mark and then
races until it runs out of heap. Then we restart. We ended up implementing
automatic rolling restarts of our cluster once the heap crosses 80% mark.

https://lh6.googleusercontent.com/-lU169R-1J3A/UqDan9D-rII/AAAAAAAAAKE/ehml5168nak/s1600/Screen+Shot+2013-12-05+at+2.55.56+PM.png
We looked for messages in the logs the first time around but couldn't find
any. We haven't let it quite crash since then...

We have 25 time-based indexes aliased to one name. 95% of our searches are match
all queries across all the indexes using the alias, sometimes with
subtypes set. We use term filters heavily-- many times with 50-500 terms
specified, nested inside boolean filters with some other criteria.

Does this help?

On Thursday, December 5, 2013 2:44:26 PM UTC-5, Jörg Prante wrote:

The graphs show that GC is working. Can you post more info about how the
queries look like and what messages appear when you run out of heap?

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7747c11f-1570-43eb-81a1-d6dfc38cfa59%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHO4itzKXxKWuG30p24cZETvu8u3KJaAkyErXWsS%2BjBrad%2B5dQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · December 6, 2013, 11:40am

Yes, term filter is the culprit. It is cached by default.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-term-filter.html

The more term filter are cached, the more your heap grows. You should
disable term filter caching to see if it works better.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEOs7SWVJuhfC1YFdSv%2ByxS3Sw5e2iSSy4SpYkQmuMx5g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Elastic search using a lot of memory, GC thrashing Elasticsearch	4	2029	July 5, 2017
ES 1.7.5 Heap filling up, Leaves cluster due to long GC Elasticsearch	3	785	September 20, 2017
Heap memory leak in Elasticsearch 6.2.4 Elasticsearch	5	1878	March 3, 2020
Elasticsearch High CPU Usage - GC Not Working Elasticsearch	26	7051	July 5, 2017
Elasticsearch 7.1x + Java 11: Possible GC misconfiguration Elasticsearch	2	562	September 30, 2019

90.5/7 OOM errors-- memory leak or GC problems?

Related topics