Frequent OutOfMemory crashes

Hey,

we upgraded our cluster (I described details in another post ) and now experience OutOfMemory errors way more frequently than before. ES serves as a logging cluster and is only used by a few developers.

What usually happens, when certain heavy load queries are performed (e.g. wildcard query or complex Kibana dashboards), memory usage increases rapidly on all data nodes (only) and sometimes the circuit breaker does not kick in on every node, causing the node to crash.

I took a heap dump of one OOM node and see that the memory is used by several of these threads elasticsearch[ip-10-0-0-238][search][T#3] Thread as well as org.elasticsearch.common.util.BigByteArray

Right before the OOM message, I see several GC overhead warnings in the log file.

[2017-06-30T12:54:32,399][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][206] overhead, spent [556ms] collecting in the last [1s]
[2017-06-30T12:54:33,399][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][207] overhead, spent [647ms] collecting in the last [1s]
[2017-06-30T12:54:37,967][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][208] overhead, spent [4.2s] collecting in the last [4.5s]
[2017-06-30T12:54:39,027][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][209] overhead, spent [705ms] collecting in the last [1s]
[2017-06-30T12:54:44,377][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][210] overhead, spent [5.1s] collecting in the last [5.3s]
[2017-06-30T12:54:53,077][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][211] overhead, spent [3.8s] collecting in the last [3.9s]

Finally the node goes OOM

[2017-06-30T12:57:16,933][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-10-0-0-238] fatal error in thread [elasticsearch[ip-10-0-0-238][search][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
	at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:99) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:96) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:53) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.recycler.AbstractRecycler.obtain(AbstractRecycler.java:33) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:28) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.recycler.Recyclers$3.obtain(Recyclers.java:119) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.PageCacheRecycler.bytePage(PageCacheRecycler.java:147) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.AbstractBigArray.newBytePage(AbstractBigArray.java:112) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.BigByteArray.<init>(BigByteArray.java:44) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.BigArrays.newByteArray(BigArrays.java:464) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.BigArrays.resize(BigArrays.java:488) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.BigArrays.grow(BigArrays.java:502) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.ensureCapacity(HyperLogLogPlusPlus.java:197) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collect(HyperLogLogPlusPlus.java:232) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.search.aggregations.metrics.cardinality.CardinalityAggregator$DirectCollector.collect(CardinalityAggregator.java:199) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:80) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$2.collect(GlobalOrdinalsStringTermsAggregator.java:127) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.apache.lucene.search.MultiCollector$MultiLeafCollector.collect(MultiCollector.java:174) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:669) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:473) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22]
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$16(IndicesService.java:1107) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.indices.IndicesService$$Lambda$2187/458402294.accept(Unknown Source) ~[?:?]
	at org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$18(IndicesService.java:1188) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.indices.IndicesService$$Lambda$2188/762572527.get(Unknown Source) ~[?:?]

Hi @BastiPaeltz,

I see a cardinality aggregation in your stack trace so I wonder whether you're hit by #24359. You could try two things:

  • As suggested in the ticket: Set "global_ordinals_hash" as execution_hint (In Kibana: Edit Viz -> aggs bucket -> advanced -> json -> {"execution_hint": "global_ordinals_hash"}")
  • Experiment with the precision control parameter

Daniel

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.