Query from Uptime app causes OutOfMemory exception in Elasticsearch

I have a three node ELK cluster with 500M documents in 900 indices. The cluster has been running without any memory-related issues for over a year.
I recently started using heartbeat and the Uptime app.
Version is 6.8.0.
I suddenly started getting OutOfMemory-exceptions that crashed one of the elasticsearch nodes. After some investigations, it is clear that this happens when you click on a link in the Error list table. See screenshot. I get a crash everytime I hit the top row in the table.

uptime_issue_2

I have heartbeat data for only 15 days. 2880 heartbeats per day in daily indices. 1 primary and one replica per index.

Here is the stacktrace, that does not say a lot

2019-06-17T14:30:21,198][WARN ][o.e.m.j.JvmGcMonitorService] [ow500logan02] [gc][884] overhead, spent [3.6s] collecting in the last [3.6s]
[2019-06-17T14:30:25,980][WARN ][o.e.m.j.JvmGcMonitorService] [ow500logan02] [gc][885] overhead, spent [4.7s] collecting in the last [4.7s]
[2019-06-17T14:31:44,943][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ow500logan02] collector [node_stats] timed out when collecting data
[2019-06-17T14:31:45,151][WARN ][o.e.m.j.JvmGcMonitorService] [ow500logan02] [gc][886] overhead, spent [59s] collecting in the last [1.3m]
[2019-06-17T14:31:45,615][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ow500logan02] fatal error in thread [elasticsearch[ow500logan02][search][T#6]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.util.AbstractBigArray.newBytePage(AbstractBigArray.java:120) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.BigByteArray.<init>(BigByteArray.java:46) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.BigArrays.newByteArray(BigArrays.java:467) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.BigArrays.newByteArray(BigArrays.java:481) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.<init>(HyperLogLogPlusPlus.java:176) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.metrics.cardinality.InternalCardinality.doReduce(InternalCardinality.java:90) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregation.reduce(InternalAggregation.java:135) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:128) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:96) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.bucket.histogram.InternalAutoDateHistogram$Bucket.reduce(InternalAutoDateHistogram.java:131) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.bucket.histogram.InternalAutoDateHistogram.reduceBuckets(InternalAutoDateHistogram.java:338) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.bucket.histogram.InternalAutoDateHistogram.doReduce(InternalAutoDateHistogram.java:500) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregation.reduce(InternalAggregation.java:135) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:128) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.SearchPhaseController.reducedQueryPhase(SearchPhaseController.java:497) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.SearchPhaseController.reducedQueryPhase(SearchPhaseController.java:412) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.SearchPhaseController$1.reduce(SearchPhaseController.java:699) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:101) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.FetchSearchPhase.access$000(FetchSearchPhase.java:44) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:86) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.0.jar:6.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]

Start-up log. Showing java version and JVM arguments

58,099][INFO ][o.e.n.Node ] [ow500logan02] version[6.8.0], pid[82954], build[default/rpm/65b6179/2019-05-15T20:06:13.172855Z], OS[Linux/3.10.0-
957.el7.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_191/25.191-b12]
[2019-06-17T14:14:58,099][INFO ][o.e.n.Node ] [ow500logan02] JVM arguments [-Xms8g, -Xmx8g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+
UseCMSInitiatingOccupancyOnly, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encod
ing=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4
j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch-2270712997360736217, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/opt/logan/e
lk/elasticsearch/logs, -XX:ErrorFile=/opt/logan/elk/elasticsearch/logs/hs_err_pid%p.log, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distr
ibution.flavor=default, -Des.distribution.type=rpm]

Hi @Mattias_Arbin,

In provided logs its showing "Heap out of memory" error. You need to check the heap space and need to resize the parameter.

Please find below link related to this
https://www.elastic.co/blog/a-heap-of-trouble

Also, please read below links as you have a big size cluster environment with 900 indices.

Regards,
Harsh Bajaj

Thanks for the advice.
It is possible that increasing heap size would help, but again, this cluster has been running for months without any memory-related issues. I just find it strange that queries against the fairly small heartbeat indices would cause a general heap shortage.
I suspect that this is more of a memory leak that consumes any available heap memory in no time.

I might try raising the heap size and see if I am right.. :wink:

The screenshot below pretty much explains what I mean. At 09.47 I opened the Uptime app in Kibana. CPU and heap allocation goes straight up and then the node crashes.

This topic was automatically closed 24 days after the last reply. New replies are no longer allowed.