Endless GC. Need update about GC best practices for operation

DidierB · February 22, 2018, 4:09pm

Hello,

I have a 12 nodes 6.0.0 elasticsearch cluster. It ingest 10k.ev/sec. Time to time, I think because of some Kibana request, the cluster stops to answer to kibana and even to local curls.

The logs are pretty clear:

[2018-02-22T16:36:43,573][WARN ][o.e.m.j.JvmGcMonitorService] [serverA] [gc][old][1748945][386] duration [33.3s], collections [1]/[33.6s], total [33.3s]/[31m], memory [29.7gb]->[29.7gb]/[29.7gb], all_pools {[young] [1.8gb]->[1.8gb]/[1.8gb]}{[survivor] [181.2mb]->[185.3mb]/[232.9mb]}{[old] [27.7gb]->[27.7gb]/[27.7gb]}
[2018-02-22T16:36:43,581][WARN ][o.e.m.j.JvmGcMonitorService] [serverA] [gc][1748945] overhead, spent [33.3s] collecting in the last [33.6s]

Is garbage collecting for 33.3s in the last 33.6s a bad thing?

Some extract of the jvm options:

-Xms30g
-Xmx30g
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-Xss1m

After this endless GC, nodes do not answer to ping and the cluster doesn't answer and need node restarts. Any advice about how to make this not happen would be welcome, or how to recover without restarts, would be welcome.

Christian_Dahlqvist · February 22, 2018, 4:22pm

Do you have monitoring installed? Do you have a graph showing how heap usage varies over time?

DidierB · February 22, 2018, 4:35pm

Thanks for you answer. I don't have access to this kind of monitoring but I can tell that the cluster can operate for weeks before this happen. Every time I have this problem, on a smaller cluster, was because of a too-big request. I can't prevent the user to do big requests, I just want them to fail when there is catastrophic consequences on the cluster.

Christian_Dahlqvist · February 22, 2018, 4:39pm

It would be useful to see how heap is being used and how much headroom you have for large queries. The cluster stats API could provide some of this information. Do you have any non-standard configuration around memory and/or circuit breakers?

DidierB · February 22, 2018, 4:56pm

I restarted the cluster so I don't know if the cluster stat will help you. I however plane to collect regular statistics.

I didn't know about the circuit breakers so I guess there is none. I'll study this page immediately.

Christian_Dahlqvist · February 22, 2018, 5:00pm

Cluster stats will give an indication about baseline heap usage, so might give some insight. Not as good as a graph though...

DidierB · February 23, 2018, 10:40am

Thanks to your advice, I put a total circuit breaker of 45% of Heap. I'll test in a few hour if it works. Which brings me to the next point. I need more heap space.

I understand the java pointer compression and my XMX is set to 30GB. However my nodes have 256GB of memory. Is it safe to put XMX to 80GB or so?

Christian_Dahlqvist · February 23, 2018, 10:48am

Going beyond 30GB is as far as I can recall still not receommended. What you can do in order to optimise the heap space on large hosts is however to run multiple nodes on it. A 256GB host should be able to handle 4 nodes , each with a 30GB heap.

DidierB · February 23, 2018, 11:37am

Unfortunately I'm not able to doing this kind of modification on the production cluster. Plus, to be efficient I would have to double the number of shards for each indexes (and this halve their size), if not the same request would saturate the JVM again, isn't it?

system · March 23, 2018, 11:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch and endless garbage collection Elasticsearch	1	404	July 6, 2017
ElasticSearch client nodes are continuously getting into GC and not restarting Elasticsearch	3	1121	March 1, 2017
Heap usage holds steady at max and GC does not run. Need to force restart the cluster Elasticsearch elastic-stack-monitoring	5	558	July 6, 2023
Elasticsearch High CPU Usage - GC Not Working Elasticsearch	26	7053	July 5, 2017
Frequently gc in one node Elasticsearch	6	1226	May 28, 2018

Endless GC. Need update about GC best practices for operation

Related topics