java.lang.OutOfMemoryError: Java heap space - GC overhead using visualizations


(Alejandro Kapit) #1

I'm running Elastic and kibana on a clean linode instance with 8 GB RAM.
Only imported a csv file through logstash into Elasticsearch in a index. Plain text fields with two Dates and some numeric ones.

About the index, it has 2 shards, 425948 docs, size is: 283.7mb

For learning how to use kibana, we created two visualizations, and when trying to use them, the whole thing becomes very very slow and kibana shows timeouts, elasticsearch do impressive GC work until Java heap memory out-of-memory arrives. It eats my heap immediately.

Here an extract from my elasticsearch logs:

[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:01:16,926][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52287] overhead, spent [11.6s] collecting in the last [11.6s]
[2018-07-05T07:01:28,246][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52288][63] duration [25.7s], collections [3]/[25.8s], total [25.7s]/[9.1m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [30.4mb]->[29.6mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:01:28,248][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52288] overhead, spent [25.7s] collecting in the last [25.8s]
[2018-07-05T07:02:20,940][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52289][69] duration [52.6s], collections [6]/[41s], total [52.6s]/[9.9m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [29.6mb]->[31mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:02:20,940][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52289] overhead, spent [52.6s] collecting in the last [41s]
[2018-07-05T07:02:43,093][INFO ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][old][52290][72] duration [22.1s], collections [3]/[33.7s], total [22.1s]/[10.3m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [31mb]->[32.1mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2018-07-05T07:02:43,093][WARN ][o.e.m.j.JvmGcMonitorService] [vzbqXTQ] [gc][52290] overhead, spent [22.1s] collecting in the last [33.7s]

...

[2018-07-05T07:03:02,210][ERROR][o.e.x.m.c.i.IndexStatsCollector] [vzbqXTQ] collector [index-stats] timed out when collecting data
[2018-07-05T07:03:39,028][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [vzbqXTQ] collector [cluster_stats] timed out when collecting data
[2018-07-05T07:03:52,852][ERROR][o.e.x.m.c.n.NodeStatsCollector] [vzbqXTQ] collector [node_stats] timed out when collecting data

...

[2018-07-05T07:20:51,870][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
	at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:182)
	at 

This is my jvm.options:

-Xms4g
-Xmx4g

-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

Here some realtime monitoring info:

This happens specially when trying to se a visualization, whose settings are this:

image

And this is its log:

 [vzbqXTQ] collector [cluster_stats] timed out when collecting data
[2018-07-05T08:08:42,220][DEBUG][o.e.a.s.TransportSearchAction] [vzbqXTQ] [wh_rates-07.04.2018][1], node[vzbqXTQNToSz2RCb5IZhTw], [P], s[STARTED], a[id=h6cXtncYQbyI6eg4InHyiA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[wh_rates-*], indicesOptions=IndicesOptions[id=39, ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false], types=[], routing='null', preference='1530777866498', requestCache=null, scroll=null, maxConcurrentShardRequests=5, batchedReduceSize=512, preFilterShardSize=64, allowPartialSearchResults=true, source={"size":0,"query":{"bool":{"must":[{"range":{"Date":{"from":1467705903138,"to":1530777903138,"include_lower":true,"include_upper":true,"format":"epoch_millis","boost":1.0}}}],"filter":[{"match_all":{"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":[],"excludes":[]},"stored_fields":"*","docvalue_fields":["@timestamp","Date","Date From Rates"],"script_fields":{},"aggregations":{"7":{"terms":{"field":"SiteId.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"1":"desc"},{"_key":"asc"}]},"aggregations":{"1":{"sum":{"field":"Installs"}},"5":{"terms":{"field":"SubPublisherIdentifier.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"4":"desc"},{"_key":"asc"}]},"aggregations":{"4":{"sum":{"field":"Cost"}},"6":{"terms":{"field":"Parent Campaign.keyword","size":50000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"4":"desc"},{"_key":"asc"}]},"aggregations":{"1":{"sum":{"field":"Installs"}},"2":{"sum":{"field":"Registrations"}},"3":{"sum":{"field":"FDs"}},"4":{"sum":{"field":"Cost"}}}}}}}}}}}] lastShard [true]
org.elasticsearch.transport.TransportException: transport stopped, action: indices:data/read/search[phase/query]

I'm pretty much lost and I'm new to ELK, can anybody help me out? Am I doing something wrong?
It doesn't look to me that just one index with 250k documents is too much for a 8GB node.

Thanks a lot

Alejandro


(Mark Walkom) #2

What version are you on?


(Alejandro Kapit) #3

6.3, current one.
Btw I see in a diagnosing tool (Cerebro) that my 2 shards are unallocated, is that problematic?


(Mark Walkom) #4

Your cluster is exposed to the internet, I guessed the IP from your image above. You might want to look at that first.


(Alejandro Kapit) #5

Thanks for that, it's already blocked :blush:


(Mark Walkom) #6

Ok cool.

So my guess is that the visualisation you have created is generating too many buckets, and it's OOMing from that.
What happens if you split the 4 metrics into their own visualisation, with the sub-buckets?


(Alejandro Kapit) #7

I disabled 3 of the 4 metrics this visualisation has, and it's still eating the whole heap.
IMO this is a kinda normal visualisation, it doesn't look like too much either.

Maybe something with the buckets can be wrong? It's too strange


(Christian Dahlqvist) #8

What sizes have you configured for your aggregations? Can you expand the view?


(Alejandro Kapit) #9

We had 50000 on size in the 4 buckets.
Is that too much?

I actually tried reducing it to 5, but it looks like however elasticsearch gc cannot finish cleaning the objects.

[vzbqXTQ] [gc][old][1026][43] duration [18.4s], collections [2]/[18.4s], total [18.4s]/[5.9m], memory [3.9gb]->[3.9gb]/[3.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [27.3mb]->[28.8mb]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}

It's kinda cannot clean the olds properly...


(Christian Dahlqvist) #10

That will generate 50000^4 buckets, which is quite a lot. What is the cardinality of those fields?


(Alejandro Kapit) #11

What is that mean?
Btw here some details on the buckets:

image


(Alejandro Kapit) #12

Btw I just disabled the aggregations to test, and without them it's also killing the heap, and it doesn't revive...


(Christian Dahlqvist) #13

Set the size to 10 or 100 instead of 50000 and see if that makes a difference.


(Alejandro Kapit) #14

Deduced it to 50 and it also collapses

ClusterStatsCollector] [vzbqXTQ] collector [cluster_stats] timed out when collecting data

(Alejandro Kapit) #15

How can I be sure the problem is in the visualisation and not something else?
I mean, It looks to me that GC is not being able to clean old objects, I'm pretty much confused here.
I appreciate any help


(Alejandro Kapit) #16

Should we use the composite aggregation instead? I don't really undestand the difference yet.

Anyways I'm a bit afraid of this situation on which elastic gc cannot just kill the oldies and recover, it demands a restart.

There is any better way of taking care of this?


(system) #18

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.