We are running a 3 nodes ES cluster. Every node have 16G Memory, ES_HEAP is set to 8G.
After Cluster started 30 hours, GC cost from 5ms up to 200ms, CPU utilization from 10% up to 50%, then cause many requests time out. So, we have to restart all nodes very day.
here is hot_threads output:
::: {es01}{yweCHrNLQMaR68zVrKECnQ}{10.10.20.97}{10.10.20.97:9300}
Hot threads at 2016-08-29T14:31:42.671Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
17.1% (85.5ms out of 500ms) cpu usage by thread 'elasticsearch[es01][search][T#4]'
3/10 snapshots sharing following 18 elements
java.util.HashMap.hash(HashMap.java:338)
java.util.HashMap.putMapEntries(HashMap.java:514)
java.util.HashMap.(HashMap.java:489)
com.google.common.collect.Maps.newHashMap(Maps.java:230)
org.elasticsearch.common.collect.MapBuilder.(MapBuilder.java:48)
org.elasticsearch.common.collect.MapBuilder.newMapBuilder(MapBuilder.java:38)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
5/10 snapshots sharing following 13 elements
org.elasticsearch.common.collect.MapBuilder.immutableMap(MapBuilder.java:88)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
unique snapshot
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1265)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:161)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
::: {es03}{vgW7IYFPTLWlcuFtANyiAg}{10.10.57.204}{10.10.57.204:9300}
Hot threads at 2016-08-29T14:31:42.678Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
23.2% (115.8ms out of 500ms) cpu usage by thread 'elasticsearch[es03][search][T#2]'
2/10 snapshots sharing following 16 elements
java.util.HashMap.(HashMap.java:489)
com.google.common.collect.Maps.newHashMap(Maps.java:230)
org.elasticsearch.common.collect.MapBuilder.(MapBuilder.java:48)
org.elasticsearch.common.collect.MapBuilder.newMapBuilder(MapBuilder.java:38)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
4/10 snapshots sharing following 16 elements
com.google.common.collect.RegularImmutableMap.checkNoConflictInBucket(RegularImmutableMap.java:104)
com.google.common.collect.RegularImmutableMap.(RegularImmutableMap.java:97)
com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:294)
org.elasticsearch.common.collect.MapBuilder.immutableMap(MapBuilder.java:88)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
4/10 snapshots sharing following 10 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1265)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:161)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
::: {es02}{0CSmzo0HSDqz_ViQItwpew}{10.10.20.37}{10.10.20.37:9300}
Hot threads at 2016-08-29T14:31:42.678Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
24.2% (121ms out of 500ms) cpu usage by thread 'elasticsearch[es02][search][T#7]'
9/10 snapshots sharing following 12 elements
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:153)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
22.5% (112.4ms out of 500ms) cpu usage by thread 'elasticsearch[es02][search][T#5]'
4/10 snapshots sharing following 18 elements
java.util.HashMap.hash(HashMap.java:338)
java.util.HashMap.putMapEntries(HashMap.java:514)
java.util.HashMap.(HashMap.java:489)
com.google.common.collect.Maps.newHashMap(Maps.java:230)
org.elasticsearch.common.collect.MapBuilder.(MapBuilder.java:48)
org.elasticsearch.common.collect.MapBuilder.newMapBuilder(MapBuilder.java:38)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
5/10 snapshots sharing following 13 elements
org.elasticsearch.common.collect.MapBuilder.immutableMap(MapBuilder.java:88)
org.elasticsearch.index.search.stats.ShardSearchStats.groupStats(ShardSearchStats.java:157)
org.elasticsearch.index.search.stats.ShardSearchStats.onPreQueryPhase(ShardSearchStats.java:76)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:374)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
I don't know anything about your application but from the graphs I conclude it's only used during business hours (around 9am - 5pm).[quote="zaka, post:2, topic:58526"]
When I set ES_HEAP_NEWSIZE=1g,GC time cost have been controlled, but cpu usage still grow up on time.
[/quote]
I'd explain this by the fact that a smaller heap is faster to collect (-> shorter GC cycles) but the garbage collector has to run more often (-> more CPU usage).
You basically have two options:
Reduce memory consumption by changing cache settings, revisiting mappings (move things off heap and use doc values where possible). You can e.g. take a heap dump and inspect the top consumers and try to reduce them by changing the appropriate settings.
Configure the garbage collector settings so it better fits your application's allocation pattern. There you can start by enabling JVM GC logs and analyzing them (e.g. with GCViewer) and then carefully tune the garbage collector.
One issue you need to be aware of is that CMS is a non-compacting collector which means that you will hit a full GC at some point due to heap fragmentation (but you should see a full GC very rarely).
Thanks @danielmitterdorfer.
In Past, my ES_HEAP_NEWSIZE is smaller than 1G, after set it bigger, frequently full gc problem have been controlled, but cpu usage is still very high.
Every query get 700 docs from ES,then rerank them, cache all 700 doc_ids in redis,return top 10 docs to user. if user want read more, get next 10 doc_ids from redis, and mget docs from es by doc_ids.
Maybe this query cause ES frequently allocation huge objects, and produce more and more head fragments. I tried set - -XX+UseCMSCompactAtFullCollection to compact fragments, but this flag is deprecated from java8.
Why do you cache this in Redis? Is there a reason why you didn't use scroll?
As I said: totally understandable, I expect that you have way more GC events, they're just much shorter.
First, this flag seems pretty esoteric and second, this flag is on by default in Java 8, so setting it explicitly is ineffective (i.e. it did not change anything). I can just reiterate what I suggested previously:
Enable the GC log by (at least) specifying: -Xloggc:elasticsearch-gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps
Restart the node(s)
Grab the GC log after some time, look at it, for example with GCViewer
Draw conclusions based on the data you see and change the minimal set of GC JVM options.
Iterate until you're satisfied with the GC behavior.
This process will take some time and it is specific to your usage pattern of Elasticsearch.
I can give you just one specific recommendation: Stick with the CMS garbage collector and stay away from G1GC, it is known to cause data corruption with Elasticsearch.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.