Massive performance issues on our production cluster

Dear Folks,

right now we see some serious performance issues on our AWS EC2 elasticsearch cluster. Our nodes are not responding very wenn and hanging at a CPU load of in average 90%.

Here is a copy of the hot_threads output:


::: [ip-XXX-XXX-XXX-XXX][e9oLvVBiRUmK6oW2aIVrqQ][ip-XXX-XXX-XXX-XXX][inet[ip-XXX-XXX-XXX-XXX.eu-west-1.compute.internal/XXX-XXX-XXX-XXX:9300]]
Hot threads at 2015-06-16T11:00:47.638Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

32.5% (162.3ms out of 500ms) cpu usage by thread 'elasticsearch[ip-XXX-XXX-XXX-XXX][search][T#1]'
2/10 snapshots sharing following 18 elements
org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.collect(FilterAggregator.java:61)
org.elasticsearch.search.aggregations.AggregationPhase$AggregationsCollector.collect(AggregationPhase.java:161)
org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:60)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:193)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
7/10 snapshots sharing following 10 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
unique snapshot
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116)
org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:101)
org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:354)
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:335)
org.elasticsearch.search.fetch.FetchPhase.loadStoredFields(FetchPhase.java:407)
org.elasticsearch.search.fetch.FetchPhase.createSearchHit(FetchPhase.java:217)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:182)
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:501)
org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:452)
org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:449)
org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

::: [ip-XXX-XXX-XXX-XXX][ZZfbwauFSZq8XXDnyVug4g][ip-XXX-XXX-XXX-XXX][inet[/XXX-XXX-XXX-XXX:9300]]
Hot threads at 2015-06-16T11:00:47.636Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

33.7% (168.6ms out of 500ms) cpu usage by thread 'elasticsearch[ip-XXX-XXX-XXX-XXX][search][T#5]'
2/10 snapshots sharing following 18 elements
org.apache.lucene.search.FilteredDocIdSetIterator.nextDoc(FilteredDocIdSetIterator.java:60)
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.nextDoc(ConstantScoreQuery.java:257)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
4/10 snapshots sharing following 18 elements
org.elasticsearch.common.lucene.search.XBooleanFilter.getDocIdSet(XBooleanFilter.java:248)
org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.setNextReader(FilterAggregator.java:52)
org.elasticsearch.search.aggregations.support.AggregationContext.setNextReader(AggregationContext.java:94)
org.elasticsearch.search.aggregations.AggregationPhase$AggregationsCollector.setNextReader(AggregationPhase.java:167)
org.elasticsearch.common.lucene.MultiCollector.setNextReader(MultiCollector.java:68)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
4/10 snapshots sharing following 12 elements
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

33.2% (165.7ms out of 500ms) cpu usage by thread 'elasticsearch[ip-XXX-XXX-XXX-XXX][search][T#1]'
3/10 snapshots sharing following 20 elements
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.nextDoc(ConstantScoreQuery.java:257)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:581)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:533)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:510)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:345)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
7/10 snapshots sharing following 9 elements
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

33.0% (165.1ms out of 500ms) cpu usage by thread 'elasticsearch[ip-XXX-XXX-XXX-XXX][search][T#2]'
4/10 snapshots sharing following 20 elements
org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucket(BucketsAggregator.java:55)
org.elasticsearch.search.aggregations.bucket.filter.FilterAggregator.collect(FilterAggregator.java:61)
org.elasticsearch.search.aggregations.AggregationPhase$AggregationsCollector.collect(AggregationPhase.java:161)
org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:60)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:193)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
3/10 snapshots sharing following 18 elements
org.apache.lucene.search.FilteredDocIdSetIterator.nextDoc(FilteredDocIdSetIterator.java:60)
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.nextDoc(ConstantScoreQuery.java:257)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192)
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
3/10 snapshots sharing following 12 elements
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:286)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:297)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)


Our cluster is based on two AWS EC2 r3.large instances with 15GB memory and SSDs for the storage. We activated mlockall (we checked also that it is active) and set our heap space to 10 GB (we also tried exactly 50% with 7 GB, no change). The cluster have 4 indexes with each 1 replica and 5 shards for later scaling requirements. We monitor and log all of our queries and respond times and there isn't a really spiky one which causes problems. Another question is why the cluster don't use all of the memory or at least the defined heap size of 10GB... 80% of our queries run against the whole dataset and there is no option to change it.

Here are some general stats:

Nodes: 2
Indices: 5
Shards: 32
Data: 28.69 GB
CPU: 362%
Memory: 3.62 GB (used) / 19.97 GB (available)
Up time: 16.4 h
Version: 1.5.2

Any suggestions? We don't really find the problem...

Thanks a lot!
Steph

Have you checked your ES logs to see if it is GC?
What sort of queries do you run?

No problems with GC. Mostly aggregation queries...

Aggregation queries can be very expensive, similar to sorting they use field data. Depending on how you store your field data this can cause problems. By default ES keeps field data inside Lucene inverted indices. Which mean every time you execute a new aggregation, Lucene has to un-invert said indices in order to pull out the field data, load it into memory, then perform the aggregation.

There are a couple options to speed this up:

  1. Use Doc Values -> https://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html
  2. Eagerly load field data -> https://www.elastic.co/guide/en/elasticsearch/guide/current/preload-fielddata.html

These are great tips, thank you very much! I wasn't aware of the future perspective of Doc Values and dive a little bit into it and come back with the results.

Best
Steph