Endless GC. Need update about GC best practices for operation

Hello,

I have a 12 nodes 6.0.0 elasticsearch cluster. It ingest 10k.ev/sec. Time to time, I think because of some Kibana request, the cluster stops to answer to kibana and even to local curls.

The logs are pretty clear:

[2018-02-22T16:36:43,573][WARN ][o.e.m.j.JvmGcMonitorService] [serverA] [gc][old][1748945][386] duration [33.3s], collections [1]/[33.6s], total [33.3s]/[31m], memory [29.7gb]->[29.7gb]/[29.7gb], all_pools {[young] [1.8gb]->[1.8gb]/[1.8gb]}{[survivor] [181.2mb]->[185.3mb]/[232.9mb]}{[old] [27.7gb]->[27.7gb]/[27.7gb]}
[2018-02-22T16:36:43,581][WARN ][o.e.m.j.JvmGcMonitorService] [serverA] [gc][1748945] overhead, spent [33.3s] collecting in the last [33.6s]

Is garbage collecting for 33.3s in the last 33.6s a bad thing? :slight_smile:

Some extract of the jvm options:

-Xms30g
-Xmx30g
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-Xss1m

After this endless GC, nodes do not answer to ping and the cluster doesn't answer and need node restarts. Any advice about how to make this not happen would be welcome, or how to recover without restarts, would be welcome.

Do you have monitoring installed? Do you have a graph showing how heap usage varies over time?

Thanks for you answer. I don't have access to this kind of monitoring but I can tell that the cluster can operate for weeks before this happen. Every time I have this problem, on a smaller cluster, was because of a too-big request. I can't prevent the user to do big requests, I just want them to fail when there is catastrophic consequences on the cluster.

It would be useful to see how heap is being used and how much headroom you have for large queries. The cluster stats API could provide some of this information. Do you have any non-standard configuration around memory and/or circuit breakers?

I restarted the cluster so I don't know if the cluster stat will help you. I however plane to collect regular statistics.

I didn't know about the circuit breakers so I guess there is none. I'll study this page immediately.

Cluster stats will give an indication about baseline heap usage, so might give some insight. Not as good as a graph though...

Thanks to your advice, I put a total circuit breaker of 45% of Heap. I'll test in a few hour if it works. Which brings me to the next point. I need more heap space.

I understand the java pointer compression and my XMX is set to 30GB. However my nodes have 256GB of memory. Is it safe to put XMX to 80GB or so?

Going beyond 30GB is as far as I can recall still not receommended. What you can do in order to optimise the heap space on large hosts is however to run multiple nodes on it. A 256GB host should be able to handle 4 nodes , each with a 30GB heap.

Unfortunately I'm not able to doing this kind of modification on the production cluster. Plus, to be efficient I would have to double the number of shards for each indexes (and this halve their size), if not the same request would saturate the JVM again, isn't it?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.