java.lang.OutOfMemoryError: Java heap space - GC overhead using visualizations

akapit · July 5, 2018, 8:14am

I'm running Elastic and kibana on a clean linode instance with 8 GB RAM.
Only imported a csv file through logstash into Elasticsearch in a index. Plain text fields with two Dates and some numeric ones.

About the index, it has 2 shards, 425948 docs, size is: 283.7mb

For learning how to use kibana, we created two visualizations, and when trying to use them, the whole thing becomes very very slow and kibana shows timeouts, elasticsearch do impressive GC work until Java heap memory out-of-memory arrives. It eats my heap immediately.

Here an extract from my elasticsearch logs:

[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:01:16,926][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52287] overhead, spent [11.6s] collecting in the last [11.6s]
[2018-07-05T07:01:28,246][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52288][63] duration [25.7s], collections [3]/[25.8s], total [25.7s]/[9.1m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [30.4mb]->[29.6mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:01:28,248][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52288] overhead, spent [25.7s] collecting in the last [25.8s]
[2018-07-05T07:02:20,940][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52289][69] duration [52.6s], collections [6]/[41s], total [52.6s]/[9.9m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [29.6mb]->[31mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:02:20,940][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52289] overhead, spent [52.6s] collecting in the last [41s]
[2018-07-05T07:02:43,093][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52290][72] duration [22.1s], collections [3]/[33.7s], total [22.1s]/[10.3m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [31mb]->[32.1mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:02:43,093][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52290] overhead, spent [22.1s] collecting in the last [33.7s]

...

[2018-07-05T07:03:02,210][ERROR][o.e.x.m.c.i.IndexStatsCollector] [vzbqXTQ] collector [index-stats] timed out when collecting data
[2018-07-05T07:03:39,028][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [vzbqXTQ] collector [cluster_stats] timed out when collecting data
[2018-07-05T07:03:52,852][ERROR][o.e.x.m.c.n.NodeStatsCollector] [vzbqXTQ] collector [node_stats] timed out when collecting data

...

[2018-07-05T07:20:51,870][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
	at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:182)
	at

This is my jvm.options:

-Xms4g
-Xmx4g

-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

Here some realtime monitoring info:

This happens specially when trying to se a visualization, whose settings are this:

And this is its log:

 [vzbqXTQ] collector [cluster_stats] timed out when collecting data
[2018-07-05T08:08:42,220][DEBUG][o.e.a.s.TransportSearchAction] [vzbqXTQ] [wh_rates-07.04.2018][1], node[vzbqXTQNToSz2RCb5IZhTw], [P], s[STARTED], a[id=h6cXtncYQbyI6eg4InHyiA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[wh_rates-*], indicesOptions=IndicesOptions[id=39, ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false], types=[], routing='null', preference='1530777866498', requestCache=null, scroll=null, maxConcurrentShardRequests=5, batchedReduceSize=512, preFilterShardSize=64, allowPartialSearchResults=true, source={"size":0,"query":{"bool":{"must":[{"range":{"Date":{"from":1467705903138,"to":1530777903138,"include_lower":true,"include_upper":true,"format":"epoch_millis","boost":1.0}}}],"filter":[{"match_all":{"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":[],"excludes":[]},"stored_fields":"*","docvalue_fields":["@timestamp","Date","Date From Rates"],"script_fields":{},"aggregations":{"7":{"terms":{"field":"SiteId.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"1":"desc"},{"_key":"asc"}]},"aggregations":{"1":{"sum":{"field":"Installs"}},"5":{"terms":{"field":"SubPublisherIdentifier.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"4":"desc"},{"_key":"asc"}]},"aggregations":{"4":{"sum":{"field":"Cost"}},"6":{"terms":{"field":"Parent Campaign.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"4":"desc"},{"_key":"asc"}]},"aggregations":{"1":{"sum":{"field":"Installs"}},"2":{"sum":{"field":"Registrations"}},"3":{"sum":{"field":"FDs"}},"4":{"sum":{"field":"Cost"}}}}}}}}}}}] lastShard [true]
org.elasticsearch.transport.TransportException: transport stopped, action: indices:data/read/search[phase/query]

I'm pretty much lost and I'm new to ELK, can anybody help me out? Am I doing something wrong?
It doesn't look to me that just one index with 250k documents is too much for a 8GB node.

Thanks a lot

Alejandro

warkolm · July 5, 2018, 8:33am

What version are you on?

akapit · July 5, 2018, 8:37am

6.3, current one.
Btw I see in a diagnosing tool (Cerebro) that my 2 shards are unallocated, is that problematic?

warkolm · July 5, 2018, 8:42am

Your cluster is exposed to the internet, I guessed the IP from your image above. You might want to look at that first.

akapit · July 5, 2018, 8:47am

Thanks for that, it's already blocked

warkolm · July 5, 2018, 8:47am

Ok cool.

So my guess is that the visualisation you have created is generating too many buckets, and it's OOMing from that.
What happens if you split the 4 metrics into their own visualisation, with the sub-buckets?

akapit · July 5, 2018, 9:08am

I disabled 3 of the 4 metrics this visualisation has, and it's still eating the whole heap.
IMO this is a kinda normal visualisation, it doesn't look like too much either.

Maybe something with the buckets can be wrong? It's too strange

Christian_Dahlqvist · July 5, 2018, 9:29am

What sizes have you configured for your aggregations? Can you expand the view?

akapit · July 5, 2018, 9:35am

We had 50000 on size in the 4 buckets.
Is that too much?

I actually tried reducing it to 5, but it looks like however elasticsearch gc cannot finish cleaning the objects.

[vzbqXTQ] [gc][old][1026][43] duration [18.4s], collections [2]/[18.4s], total [18.4s]/[5.9m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [27.3mb]->[28.8mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}

It's kinda cannot clean the olds properly...

Christian_Dahlqvist · July 5, 2018, 9:35am

That will generate 50000^4 buckets, which is quite a lot. What is the cardinality of those fields?

akapit · July 5, 2018, 9:42am

What is that mean?
Btw here some details on the buckets:

akapit · July 5, 2018, 9:44am

Btw I just disabled the aggregations to test, and without them it's also killing the heap, and it doesn't revive...

Christian_Dahlqvist · July 5, 2018, 9:54am

Set the size to 10 or 100 instead of 50000 and see if that makes a difference.

akapit · July 5, 2018, 10:15am

Deduced it to 50 and it also collapses

ClusterStatsCollector] [vzbqXTQ] collector [cluster_stats] timed out when collecting data

akapit · July 5, 2018, 10:18am

How can I be sure the problem is in the visualisation and not something else?
I mean, It looks to me that GC is not being able to clean old objects, I'm pretty much confused here.
I appreciate any help

akapit · July 5, 2018, 10:39am

Should we use the composite aggregation instead? I don't really undestand the difference yet.

Anyways I'm a bit afraid of this situation on which elastic gc cannot just kill the oldies and recover, it demands a restart.

There is any better way of taking care of this?

system · August 2, 2018, 12:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch:java.lang.OutOfMemoryError: Java heap space Elasticsearch	9	12892	October 23, 2017
Kibana crashes Elasticsearch Kibana	8	1920	July 6, 2017
Elastic node crashes after kibana query - with java.lang.OutOfMemoryError: Java heap space Elasticsearch	6	841	December 9, 2019
Elasticsearch 7.1x + Java 11: Possible GC misconfiguration Elasticsearch	2	562	September 30, 2019
Heap Dump of java.lang.OutOfMemoryError: Java heap space Elasticsearch	2	875	July 6, 2017

java.lang.OutOfMemoryError: Java heap space - GC overhead using visualizations

Related topics