Hey,
we upgraded our cluster (I described details in another post ) and now experience OutOfMemory errors way more frequently than before. ES serves as a logging cluster and is only used by a few developers.
What usually happens, when certain heavy load queries are performed (e.g. wildcard query or complex Kibana dashboards), memory usage increases rapidly on all data nodes (only) and sometimes the circuit breaker does not kick in on every node, causing the node to crash.
I took a heap dump of one OOM node and see that the memory is used by several of these threads elasticsearch[ip-10-0-0-238][search][T#3] Thread
as well as org.elasticsearch.common.util.BigByteArray
Right before the OOM message, I see several GC overhead warnings in the log file.
[2017-06-30T12:54:32,399][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][206] overhead, spent [556ms] collecting in the last [1s]
[2017-06-30T12:54:33,399][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][207] overhead, spent [647ms] collecting in the last [1s]
[2017-06-30T12:54:37,967][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][208] overhead, spent [4.2s] collecting in the last [4.5s]
[2017-06-30T12:54:39,027][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][209] overhead, spent [705ms] collecting in the last [1s]
[2017-06-30T12:54:44,377][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][210] overhead, spent [5.1s] collecting in the last [5.3s]
[2017-06-30T12:54:53,077][WARN ][o.e.m.j.JvmGcMonitorService] [ip-10-0-0-238] [gc][211] overhead, spent [3.8s] collecting in the last [3.9s]
Finally the node goes OOM
[2017-06-30T12:57:16,933][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-10-0-0-238] fatal error in thread [elasticsearch[ip-10-0-0-238][search][T#4]], exiting java.lang.OutOfMemoryError: Java heap space at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:99) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:96) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:53) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.recycler.AbstractRecycler.obtain(AbstractRecycler.java:33) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:28) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.recycler.Recyclers$3.obtain(Recyclers.java:119) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.util.PageCacheRecycler.bytePage(PageCacheRecycler.java:147) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.util.AbstractBigArray.newBytePage(AbstractBigArray.java:112) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.util.BigByteArray.<init>(BigByteArray.java:44) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.util.BigArrays.newByteArray(BigArrays.java:464) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.util.BigArrays.resize(BigArrays.java:488) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.common.util.BigArrays.grow(BigArrays.java:502) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.ensureCapacity(HyperLogLogPlusPlus.java:197) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.collect(HyperLogLogPlusPlus.java:232) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.search.aggregations.metrics.cardinality.CardinalityAggregator$DirectCollector.collect(CardinalityAggregator.java:199) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:80) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$2.collect(GlobalOrdinalsStringTermsAggregator.java:127) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82) ~[elasticsearch-5.4.0.jar:5.4.0] at org.apache.lucene.search.MultiCollector$MultiLeafCollector.collect(MultiCollector.java:174) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22] at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22] at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22] at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:669) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:473) ~[lucene-core-6.5.0.jar:6.5.0 4b16c9a10c3c00cafaf1fc92ec3276a7bc7b8c95 - jimczi - 2017-03-21 20:40:22] at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$16(IndicesService.java:1107) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.indices.IndicesService$$Lambda$2187/458402294.accept(Unknown Source) ~[?:?] at org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$18(IndicesService.java:1188) ~[elasticsearch-5.4.0.jar:5.4.0] at org.elasticsearch.indices.IndicesService$$Lambda$2188/762572527.get(Unknown Source) ~[?:?]