I have a three node ELK cluster with 500M documents in 900 indices. The cluster has been running without any memory-related issues for over a year.
I recently started using heartbeat and the Uptime app.
Version is 6.8.0.
I suddenly started getting OutOfMemory-exceptions that crashed one of the elasticsearch nodes. After some investigations, it is clear that this happens when you click on a link in the Error list table. See screenshot. I get a crash everytime I hit the top row in the table.
I have heartbeat data for only 15 days. 2880 heartbeats per day in daily indices. 1 primary and one replica per index.
2019-06-17T14:30:21,198][WARN ][o.e.m.j.JvmGcMonitorService] [ow500logan02] [gc][884] overhead, spent [3.6s] collecting in the last [3.6s]
[2019-06-17T14:30:25,980][WARN ][o.e.m.j.JvmGcMonitorService] [ow500logan02] [gc][885] overhead, spent [4.7s] collecting in the last [4.7s]
[2019-06-17T14:31:44,943][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ow500logan02] collector [node_stats] timed out when collecting data
[2019-06-17T14:31:45,151][WARN ][o.e.m.j.JvmGcMonitorService] [ow500logan02] [gc][886] overhead, spent [59s] collecting in the last [1.3m]
[2019-06-17T14:31:45,615][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ow500logan02] fatal error in thread [elasticsearch[ow500logan02][search][T#6]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.util.AbstractBigArray.newBytePage(AbstractBigArray.java:120) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.BigByteArray.<init>(BigByteArray.java:46) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.BigArrays.newByteArray(BigArrays.java:467) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.BigArrays.newByteArray(BigArrays.java:481) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.<init>(HyperLogLogPlusPlus.java:176) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.metrics.cardinality.InternalCardinality.doReduce(InternalCardinality.java:90) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregation.reduce(InternalAggregation.java:135) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:128) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:96) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.bucket.histogram.InternalAutoDateHistogram$Bucket.reduce(InternalAutoDateHistogram.java:131) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.bucket.histogram.InternalAutoDateHistogram.reduceBuckets(InternalAutoDateHistogram.java:338) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.bucket.histogram.InternalAutoDateHistogram.doReduce(InternalAutoDateHistogram.java:500) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregation.reduce(InternalAggregation.java:135) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:128) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.SearchPhaseController.reducedQueryPhase(SearchPhaseController.java:497) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.SearchPhaseController.reducedQueryPhase(SearchPhaseController.java:412) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.SearchPhaseController$1.reduce(SearchPhaseController.java:699) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:101) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.FetchSearchPhase.access$000(FetchSearchPhase.java:44) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:86) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.0.jar:6.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Thanks for the advice.
It is possible that increasing heap size would help, but again, this cluster has been running for months without any memory-related issues. I just find it strange that queries against the fairly small heartbeat indices would cause a general heap shortage.
I suspect that this is more of a memory leak that consumes any available heap memory in no time.
I might try raising the heap size and see if I am right..
The screenshot below pretty much explains what I mean. At 09.47 I opened the Uptime app in Kibana. CPU and heap allocation goes straight up and then the node crashes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.