java.lang.OutOfMemoryError: Java heap space - GC overhead using visualizations

I'm running Elastic and kibana on a clean linode instance with 8 GB RAM.
Only imported a csv file through logstash into Elasticsearch in a index. Plain text fields with two Dates and some numeric ones.

About the index, it has 2 shards, 425948 docs, size is: 283.7mb

For learning how to use kibana, we created two visualizations, and when trying to use them, the whole thing becomes very very slow and kibana shows timeouts, elasticsearch do impressive GC work until Java heap memory out-of-memory arrives. It eats my heap immediately.

Here an extract from my elasticsearch logs:

[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:01:16,926][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52287] overhead, spent [11.6s] collecting in the last [11.6s]
[2018-07-05T07:01:28,246][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52288][63] duration [25.7s], collections [3]/[25.8s], total [25.7s]/[9.1m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [30.4mb]->[29.6mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:01:28,248][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52288] overhead, spent [25.7s] collecting in the last [25.8s]
[2018-07-05T07:02:20,940][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52289][69] duration [52.6s], collections [6]/[41s], total [52.6s]/[9.9m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [29.6mb]->[31mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:02:20,940][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52289] overhead, spent [52.6s] collecting in the last [41s]
[2018-07-05T07:02:43,093][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52290][72] duration [22.1s], collections [3]/[33.7s], total [22.1s]/[10.3m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [31mb]->[32.1mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:02:43,093][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52290] overhead, spent [22.1s] collecting in the last [33.7s]

...

[2018-07-05T07:03:02,210][ERROR][o.e.x.m.c.i.IndexStatsCollector] [vzbqXTQ] collector [index-stats] timed out when collecting data
[2018-07-05T07:03:39,028][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [vzbqXTQ] collector [cluster_stats] timed out when collecting data
[2018-07-05T07:03:52,852][ERROR][o.e.x.m.c.n.NodeStatsCollector] [vzbqXTQ] collector [node_stats] timed out when collecting data

...

[2018-07-05T07:20:51,870][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
	at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:182)
	at 

This is my jvm.options:

-Xms4g
-Xmx4g

-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

Here some realtime monitoring info:

This happens specially when trying to se a visualization, whose settings are this:

image

And this is its log:

 [vzbqXTQ] collector [cluster_stats] timed out when collecting data
[2018-07-05T08:08:42,220][DEBUG][o.e.a.s.TransportSearchAction] [vzbqXTQ] [wh_rates-07.04.2018][1], node[vzbqXTQNToSz2RCb5IZhTw], [P], s[STARTED], a[id=h6cXtncYQbyI6eg4InHyiA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[wh_rates-*], indicesOptions=IndicesOptions[id=39, ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false], types=[], routing='null', preference='1530777866498', requestCache=null, scroll=null, maxConcurrentShardRequests=5, batchedReduceSize=512, preFilterShardSize=64, allowPartialSearchResults=true, source={"size":0,"query":{"bool":{"must":[{"range":{"Date":{"from":1467705903138,"to":1530777903138,"include_lower":true,"include_upper":true,"format":"epoch_millis","boost":1.0}}}],"filter":[{"match_all":{"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":[],"excludes":[]},"stored_fields":"*","docvalue_fields":["@timestamp","Date","Date From Rates"],"script_fields":{},"aggregations":{"7":{"terms":{"field":"SiteId.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"1":"desc"},{"_key":"asc"}]},"aggregations":{"1":{"sum":{"field":"Installs"}},"5":{"terms":{"field":"SubPublisherIdentifier.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"4":"desc"},{"_key":"asc"}]},"aggregations":{"4":{"sum":{"field":"Cost"}},"6":{"terms":{"field":"Parent Campaign.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"4":"desc"},{"_key":"asc"}]},"aggregations":{"1":{"sum":{"field":"Installs"}},"2":{"sum":{"field":"Registrations"}},"3":{"sum":{"field":"FDs"}},"4":{"sum":{"field":"Cost"}}}}}}}}}}}] lastShard [true]
org.elasticsearch.transport.TransportException: transport stopped, action: indices:data/read/search[phase/query]

I'm pretty much lost and I'm new to ELK, can anybody help me out? Am I doing something wrong?
It doesn't look to me that just one index with 250k documents is too much for a 8GB node.

Thanks a lot

Alejandro

What version are you on?

6.3, current one.
Btw I see in a diagnosing tool (Cerebro) that my 2 shards are unallocated, is that problematic?

Your cluster is exposed to the internet, I guessed the IP from your image above. You might want to look at that first.

1 Like

Thanks for that, it's already blocked :blush:

1 Like

Ok cool.

So my guess is that the visualisation you have created is generating too many buckets, and it's OOMing from that.
What happens if you split the 4 metrics into their own visualisation, with the sub-buckets?

I disabled 3 of the 4 metrics this visualisation has, and it's still eating the whole heap.
IMO this is a kinda normal visualisation, it doesn't look like too much either.

Maybe something with the buckets can be wrong? It's too strange

What sizes have you configured for your aggregations? Can you expand the view?

2 Likes

We had 50000 on size in the 4 buckets.
Is that too much?

I actually tried reducing it to 5, but it looks like however elasticsearch gc cannot finish cleaning the objects.

[vzbqXTQ] [gc][old][1026][43] duration [18.4s], collections [2]/[18.4s], total [18.4s]/[5.9m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [27.3mb]->[28.8mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}

It's kinda cannot clean the olds properly...

That will generate 50000^4 buckets, which is quite a lot. What is the cardinality of those fields?

1 Like

What is that mean?
Btw here some details on the buckets:

image

Btw I just disabled the aggregations to test, and without them it's also killing the heap, and it doesn't revive...

Set the size to 10 or 100 instead of 50000 and see if that makes a difference.

Deduced it to 50 and it also collapses

ClusterStatsCollector] [vzbqXTQ] collector [cluster_stats] timed out when collecting data

How can I be sure the problem is in the visualisation and not something else?
I mean, It looks to me that GC is not being able to clean old objects, I'm pretty much confused here.
I appreciate any help

Should we use the composite aggregation instead? I don't really undestand the difference yet.

Anyways I'm a bit afraid of this situation on which elastic gc cannot just kill the oldies and recover, it demands a restart.

There is any better way of taking care of this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.