Elastic node crashes after kibana query - with java.lang.OutOfMemoryError: Java heap space

the Query in Kibana is this simple:
Metrics: unique count(device_id.keyword)
Buckets: subscription_code.keyword

My cluster is 8 nodes 8core/64G RAM, 31G Heap,
Elastic version 6.4.2

[2019-11-08T15:19:22,118][WARN ][o.e.m.j.JvmGcMonitorService] [elastic_node5] [gc][old][10741][6] duration [29.5s], collections [2]/[29.6s], total [29.5s]/[29.9s], memory [30.3gb]->[30.8gb]/[30.9gb], all_pools {[young] [398.1mb]->[532.5mb]/[
532.5mb]}{[survivor] [66.5mb]->[25.4mb]/[66.5mb]}{[old] [29.8gb]->[30.3gb]/[30.3gb]}
[2019-11-08T15:19:22,123][WARN ][o.e.m.j.JvmGcMonitorService] [elastic_node5] [gc][10741] overhead, spent [29.5s] collecting in the last [29.6s]
[2019-11-08T15:20:21,437][WARN ][o.e.m.j.JvmGcMonitorService] [elastic_node5] [gc][old][10742][11] duration [1m], collections [5]/[1m], total [1m]/[1.5m], memory [30.8gb]->[30.9gb]/[30.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}
{[survivor] [25.4mb]->[65.5mb]/[66.5mb]}{[old] [30.3gb]->[30.3gb]/[30.3gb]}
[2019-11-08T15:20:21,437][WARN ][o.e.m.j.JvmGcMonitorService] [elastic_node5] [gc][10742] overhead, spent [1m] collecting in the last [1m]
[2019-11-08T15:28:20,244][ERROR][o.e.x.m.c.n.NodeStatsCollector] [elastic_node5] collector [node_stats] timed out when collecting data
[2019-11-08T15:28:21,143][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [elastic_node5] send message failed [channel: NettyTcpChannel{localAddress=/1.1.1.1:9300, remoteAddress=/12.2.2.2:58926}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-11-08T15:28:21,144][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [elastic_node5] send message failed [channel: NettyTcpChannel{localAddress=/1.1.1.1:9300, remoteAddress=/12.2.2.2:58926}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-11-08T15:28:21,144][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [elastic_node5] send message failed [channel: NettyTcpChannel{localAddress=/1.1.1.1:9300, remoteAddress=/12.2.2.2:58926}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-11-08T15:28:21,144][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [elastic_node5] send message failed [channel: NettyTcpChannel{localAddress=/1.1.1.1:9300, remoteAddress=/12.2.2.2:58926}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-11-08T15:28:20,404][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [elastic_node5] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/3.3.3.3:38318}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-11-08T15:28:21,066][ERROR][o.e.i.e.Engine ] [elastic_node5] [index_norm_ekt_00847][3] already closed by tragic event on the index writer

2019-11-08T15:28:21,066][ERROR][o.e.i.e.Engine ] [elastic_node5] [index_norm_ekt_00847][3] already closed by tragic event on the index writer
java.lang.OutOfMemoryError: Java heap space

Do you have any idea experience why this happend?
User query crashes the server, why elastic does not protect against such queries?

What does your node configuration look like? Do you have any non-default settings?

1 Like

@Christian_Dahlqvist good question
this is my config (same on all nodes)

node01:/etc/elasticsearch# cat elasticsearch.yml |grep -v ^#
cluster.name: corp-cz-cem
node.name: node01-prahkz
path.data: /data/elasticsearch
path.logs: /data/elasticsearch/log
bootstrap.memory_lock: true
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["node13-prahkzcorp", "node14-prahkzcorp", "node15-prahkzcorp", "node16-prahkzcorp"]

node01-prahkz:/etc/elasticsearch# cat jvm.options |grep -v ^#

-Xms31g
-Xmx31g

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

-XX:+AlwaysPreTouch

-Xss1m

-Djava.awt.headless=true

-Dfile.encoding=UTF-8

-Djna.nosys=true

-XX:-OmitStackTraceInFastThrow

-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

-XX:+HeapDumpOnOutOfMemoryError

-XX:HeapDumpPath=/data/elasticsearch/log

-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
9-:-Djava.locale.providers=COMPAT

10-:-XX:UseAVX=2

What is the settings for those metrics and buckets in Kibana? What is the cardinality of the device_id and subscription_code fields? Are you using default dynamic mappings?

1 Like

Unique count of device_id: 280k
Count: 28 mil
Unique count of subscription_code.keyword: 240k

this are number among whole day, I need to breakdown it to time windows. 3-4h long time buckets. However in peak there could be the same amount of numbers provided.

what other approach would you suggest to calculate this use-case?