Dear Folks,
right now we see some serious performance issues on our AWS EC2 elasticsearch cluster. Our nodes are not responding very wenn and hanging at a CPU load of in average 90%.
Here is a copy of the hot_threads output:
::: [ip-XXX-XXX-XXX-XXX][e9oLvVBiRUmK6oW2aIVrqQ][ip-XXX-XXX-XXX-XXX][inet[ip-XXX-XXX-XXX-XXX.eu-west-1.compute.internal/XXX-XXX-XXX-XXX:9300]]
Hot threads at 2015-06-16T11:00:47.638Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
32.5% (162.3ms out of 500ms) cpu usage by thread 'elasticsearch[ip-XXX-XXX-XXX-XXX][search][T#1]'
2/10 snapshots sharing following 18 elements
org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.collect(FilterAggregator.java:61)
org.elasticsearch.search.aggregations.AggregationPhase$AggregationsCollector.collect(AggregationPhase.java:161)
org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:60)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:193)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
7/10 snapshots sharing following 10 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
unique snapshot
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116)
org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:101)
org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:354)
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:335)
org.elasticsearch.search.fetch.FetchPhase.loadStoredFields(FetchPhase.java:407)
org.elasticsearch.search.fetch.FetchPhase.createSearchHit(FetchPhase.java:217)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:182)
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:501)
org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:452)
org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:449)
org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
::: [ip-XXX-XXX-XXX-XXX][ZZfbwauFSZq8XXDnyVug4g][ip-XXX-XXX-XXX-XXX][inet[/XXX-XXX-XXX-XXX:9300]]
Hot threads at 2015-06-16T11:00:47.636Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
33.7% (168.6ms out of 500ms) cpu usage by thread 'elasticsearch[ip-XXX-XXX-XXX-XXX][search][T#5]'
2/10 snapshots sharing following 18 elements
org.apache.lucene.search.FilteredDocIdSetIterator.nextDoc(FilteredDocIdSetIterator.java:60)
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.nextDoc(ConstantScoreQuery.java:257)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
4/10 snapshots sharing following 18 elements
org.elasticsearch.common.lucene.search.XBooleanFilter.getDocIdSet(XBooleanFilter.java:248)
org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.setNextReader(FilterAggregator.java:52)
org.elasticsearch.search.aggregations.support.AggregationContext.setNextReader(AggregationContext.java:94)
org.elasticsearch.search.aggregations.AggregationPhase$AggregationsCollector.setNextReader(AggregationPhase.java:167)
org.elasticsearch.common.lucene.MultiCollector.setNextReader(MultiCollector.java:68)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
4/10 snapshots sharing following 12 elements
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
33.2% (165.7ms out of 500ms) cpu usage by thread 'elasticsearch[ip-XXX-XXX-XXX-XXX][search][T#1]'
3/10 snapshots sharing following 20 elements
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.nextDoc(ConstantScoreQuery.java:257)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:581)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:533)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:510)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:345)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
7/10 snapshots sharing following 9 elements
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
33.0% (165.1ms out of 500ms) cpu usage by thread 'elasticsearch[ip-XXX-XXX-XXX-XXX][search][T#2]'
4/10 snapshots sharing following 20 elements
org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucket(BucketsAggregator.java:55)
org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.collect(FilterAggregator.java:61)
org.elasticsearch.search.aggregations.AggregationPhase$AggregationsCollector.collect(AggregationPhase.java:161)
org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:60)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:193)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
3/10 snapshots sharing following 18 elements
org.apache.lucene.search.FilteredDocIdSetIterator.nextDoc(FilteredDocIdSetIterator.java:60)
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.nextDoc(ConstantScoreQuery.java:257)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
3/10 snapshots sharing following 12 elements
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Our cluster is based on two AWS EC2 r3.large instances with 15GB memory and SSDs for the storage. We activated mlockall (we checked also that it is active) and set our heap space to 10 GB (we also tried exactly 50% with 7 GB, no change). The cluster have 4 indexes with each 1 replica and 5 shards for later scaling requirements. We monitor and log all of our queries and respond times and there isn't a really spiky one which causes problems. Another question is why the cluster don't use all of the memory or at least the defined heap size of 10GB... 80% of our queries run against the whole dataset and there is no option to change it.
Here are some general stats:
Nodes: 2
Indices: 5
Shards: 32
Data: 28.69 GB
CPU: 362%
Memory: 3.62 GB (used) / 19.97 GB (available)
Up time: 16.4 h
Version: 1.5.2
Any suggestions? We don't really find the problem...
Thanks a lot!
Steph