GC time cost rise on time in elastic 2.1.1

We are running a 3 nodes ES cluster. Every node have 16G Memory, ES_HEAP is set to 8G.
After Cluster started 30 hours, GC cost from 5ms up to 200ms, CPU utilization from 10% up to 50%, then cause many requests time out. So, we have to restart all nodes very day.

ES_HEAP

GC time cost(millisecond)

After Read https://www.elastic.co/blog/a-heap-of-trouble, We Set ES_HEAP=3G, but did not slove this problem.

Queries Per Second: 20 function score queries and 200 mget queries.
I found filter cache count is small in our's cluster, only about 800.

Any help would be greatly appreciated.

Thanks.

P.S.

  1. When I set ES_HEAP_NEWSIZE=1g,GC time cost have been controlled, but cpu usage still grow up on time.
  2. If I rebuild a new index and set alias to new index, then heap size, GC time cost, cpu usage cool down, as same as restart.
  3. upgraded to 2.3.5, still the same problem.

Any help would be greatly appreciated.

here is hot_threads output:
::: {es01}{yweCHrNLQMaR68zVrKECnQ}{10.10.20.97}{10.10.20.97:9300}
Hot threads at 2016-08-29T14:31:42.671Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

17.1% (85.5ms out of 500ms) cpu usage by thread 'elasticsearch[es01][search][T#4]'
3/10 snapshots sharing following 18 elements
java.util.HashMap.hash(HashMap.java:338)
java.util.HashMap.putMapEntries(HashMap.java:514)
java.util.HashMap.(HashMap.java:489)
com.google.common.collect.Maps.newHashMap(Maps.java:230)
org.elasticsearch.common.collect.MapBuilder.(MapBuilder.java:48)
org.elasticsearch.common.collect.MapBuilder.newMapBuilder(MapBuilder.java:38)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
5/10 snapshots sharing following 13 elements
org.elasticsearch.common.collect.MapBuilder.immutableMap(MapBuilder.java:88)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
unique snapshot
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1265)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:161)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

::: {es03}{vgW7IYFPTLWlcuFtANyiAg}{10.10.57.204}{10.10.57.204:9300}
Hot threads at 2016-08-29T14:31:42.678Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

23.2% (115.8ms out of 500ms) cpu usage by thread 'elasticsearch[es03][search][T#2]'
2/10 snapshots sharing following 16 elements
java.util.HashMap.(HashMap.java:489)
com.google.common.collect.Maps.newHashMap(Maps.java:230)
org.elasticsearch.common.collect.MapBuilder.(MapBuilder.java:48)
org.elasticsearch.common.collect.MapBuilder.newMapBuilder(MapBuilder.java:38)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
4/10 snapshots sharing following 16 elements
com.google.common.collect.RegularImmutableMap.checkNoConflictInBucket(RegularImmutableMap.java:104)
com.google.common.collect.RegularImmutableMap.(RegularImmutableMap.java:97)
com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:294)
org.elasticsearch.common.collect.MapBuilder.immutableMap(MapBuilder.java:88)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
4/10 snapshots sharing following 10 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1265)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:161)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

::: {es02}{0CSmzo0HSDqz_ViQItwpew}{10.10.20.37}{10.10.20.37:9300}
Hot threads at 2016-08-29T14:31:42.678Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

24.2% (121ms out of 500ms) cpu usage by thread 'elasticsearch[es02][search][T#7]'
9/10 snapshots sharing following 12 elements
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:153)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

22.5% (112.4ms out of 500ms) cpu usage by thread 'elasticsearch[es02][search][T#5]'
4/10 snapshots sharing following 18 elements
java.util.HashMap.hash(HashMap.java:338)
java.util.HashMap.putMapEntries(HashMap.java:514)
java.util.HashMap.(HashMap.java:489)
com.google.common.collect.Maps.newHashMap(Maps.java:230)
org.elasticsearch.common.collect.MapBuilder.(MapBuilder.java:48)
org.elasticsearch.common.collect.MapBuilder.newMapBuilder(MapBuilder.java:38)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
5/10 snapshots sharing following 13 elements
org.elasticsearch.common.collect.MapBuilder.immutableMap(MapBuilder.java:88)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Hi @zaka,

I don't know anything about your application but from the graphs I conclude it's only used during business hours (around 9am - 5pm).[quote="zaka, post:2, topic:58526"]

  1. When I set ES_HEAP_NEWSIZE=1g,GC time cost have been controlled, but cpu usage still grow up on time.
    [/quote]

I'd explain this by the fact that a smaller heap is faster to collect (-> shorter GC cycles) but the garbage collector has to run more often (-> more CPU usage).

You basically have two options:

  • Reduce memory consumption by changing cache settings, revisiting mappings (move things off heap and use doc values where possible). You can e.g. take a heap dump and inspect the top consumers and try to reduce them by changing the appropriate settings.
  • Configure the garbage collector settings so it better fits your application's allocation pattern. There you can start by enabling JVM GC logs and analyzing them (e.g. with GCViewer) and then carefully tune the garbage collector.

One issue you need to be aware of is that CMS is a non-compacting collector which means that you will hit a full GC at some point due to heap fragmentation (but you should see a full GC very rarely).

Daniel

Thanks @danielmitterdorfer.
In Past, my ES_HEAP_NEWSIZE is smaller than 1G, after set it bigger, frequently full gc problem have been controlled, but cpu usage is still very high.

Every query get 700 docs from ES,then rerank them, cache all 700 doc_ids in redis,return top 10 docs to user. if user want read more, get next 10 doc_ids from redis, and mget docs from es by doc_ids.

Maybe this query cause ES frequently allocation huge objects, and produce more and more head fragments. I tried set - -XX+UseCMSCompactAtFullCollection to compact fragments, but this flag is deprecated from java8.

I appreciate your advise.

Hi @zaka,

Why do you cache this in Redis? Is there a reason why you didn't use scroll?

As I said: totally understandable, I expect that you have way more GC events, they're just much shorter.

First, this flag seems pretty esoteric and second, this flag is on by default in Java 8, so setting it explicitly is ineffective (i.e. it did not change anything). I can just reiterate what I suggested previously:

  1. Enable the GC log by (at least) specifying: -Xloggc:elasticsearch-gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps
  2. Restart the node(s)
  3. Grab the GC log after some time, look at it, for example with GCViewer
  4. Draw conclusions based on the data you see and change the minimal set of GC JVM options.
  5. Iterate until you're satisfied with the GC behavior.

This process will take some time and it is specific to your usage pattern of Elasticsearch.

I can give you just one specific recommendation: Stick with the CMS garbage collector and stay away from G1GC, it is known to cause data corruption with Elasticsearch.

Daniel

Why so many byte objects in heap ? I have no idea.

jmap -histo
num #instances #bytes class name

1: 873753 538406624 [B
2: 8540639 273300448 java.util.HashMap$Node
3: 1497478 204082112 [C
4: 6325279 202408928 com.twitter.jsr166e.LongAdder
5: 6149502 147588048 com.google.common.collect.ImmutableMapEntry$TerminalEntry
6: 70613 127745104 [I
7: 246 84082496 [Lcom.google.common.collect.ImmutableMapEntry;
8: 119509 55424216 [Ljava.util.HashMap$Node;
9: 2285593 54854232 com.google.common.collect.RegularImmutableMap$NonTerminalMapEntry
10: 2108408 50601792 org.elasticsearch.common.metrics.MeanMetric
11: 2108463 33735408 org.elasticsearch.common.metrics.CounterMetric
12: 1395236 33485664 java.lang.String
13: 24 30924392 [Ljava.util.Map$Entry;
14: 702794 28111760 org.elasticsearch.index.search.stats.ShardSearchStats$StatsHolder
15: 96159 25635448 [J
16: 295196 21254112 org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput
17: 421789 13497248 org.wltea.analyzer.dic.DictSegment
18: 244354 11728992 java.nio.HeapByteBuffer
19: 340474 10895168 java.util.concurrent.ConcurrentHashMap$Node
20: 318179 10181728 org.apache.lucene.index.SingletonSortedNumericDocValues

We need rerank docs(complicated business requirements), so cann't use scroll.