I have a 12 nodes 6.0.0 elasticsearch cluster. It ingest 10k.ev/sec. Time to time, I think because of some Kibana request, the cluster stops to answer to kibana and even to local curls.
The logs are pretty clear:
[2018-02-22T16:36:43,573][WARN ][o.e.m.j.JvmGcMonitorService] [serverA] [gc][old][1748945][386] duration [33.3s], collections [1]/[33.6s], total [33.3s]/[31m], memory [29.7gb]->[29.7gb]/[29.7gb], all_pools {[young] [1.8gb]->[1.8gb]/[1.8gb]}{[survivor] [181.2mb]->[185.3mb]/[232.9mb]}{[old] [27.7gb]->[27.7gb]/[27.7gb]}
[2018-02-22T16:36:43,581][WARN ][o.e.m.j.JvmGcMonitorService] [serverA] [gc][1748945] overhead, spent [33.3s] collecting in the last [33.6s]
Is garbage collecting for 33.3s in the last 33.6s a bad thing?
After this endless GC, nodes do not answer to ping and the cluster doesn't answer and need node restarts. Any advice about how to make this not happen would be welcome, or how to recover without restarts, would be welcome.
Thanks for you answer. I don't have access to this kind of monitoring but I can tell that the cluster can operate for weeks before this happen. Every time I have this problem, on a smaller cluster, was because of a too-big request. I can't prevent the user to do big requests, I just want them to fail when there is catastrophic consequences on the cluster.
It would be useful to see how heap is being used and how much headroom you have for large queries. The cluster stats API could provide some of this information. Do you have any non-standard configuration around memory and/or circuit breakers?
Thanks to your advice, I put a total circuit breaker of 45% of Heap. I'll test in a few hour if it works. Which brings me to the next point. I need more heap space.
I understand the java pointer compression and my XMX is set to 30GB. However my nodes have 256GB of memory. Is it safe to put XMX to 80GB or so?
Going beyond 30GB is as far as I can recall still not receommended. What you can do in order to optimise the heap space on large hosts is however to run multiple nodes on it. A 256GB host should be able to handle 4 nodes , each with a 30GB heap.
Unfortunately I'm not able to doing this kind of modification on the production cluster. Plus, to be efficient I would have to double the number of shards for each indexes (and this halve their size), if not the same request would saturate the JVM again, isn't it?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.