[We are on Elasticsearch 1.7.2, 30 node cluster]
Occasionally we see maxed out CPU usage across the cluster. Each time the cluster resolves the situation after about 20 minutes but in that period the cluster is unresponsive.
Output from hot threads indicates that most nodes are running/stuck in the same place of the Lucene code, here is a snippet.
100.6% (503.1ms out of 500ms) cpu usage by thread 'elasticsearch[node1][search][T#3]' 10/10 snapshots sharing following 20 elements org.elasticsearch.common.lucene.docset.AndDocIdSet$AndBits.get(AndDocIdSet.java:116) org.elasticsearch.common.lucene.docset.BitsDocIdSetIterator.matchDoc(BitsDocIdSetIterator.java:45) org.elasticsearch.common.lucene.docset.MatchDocIdSetIterator.nextDoc(MatchDocIdSetIterator.java:50) org.apache.lucene.search.FilteredDocIdSetIterator.nextDoc(FilteredDocIdSetIterator.java:59) org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.nextDoc(ConstantScoreQuery.java:257) org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192) org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163) org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35) org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621) org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191) org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309) org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117) org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:370) org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:795) org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:786) org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279) org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)
I've had a quick look at the Lucene code and it looks like a pretty inconspicuous line of code. My thoughts are that this indicates that Elasticsearch isn't stuck at that line in question, more that it is running that line over and over.
Anyone seen anything similar and/or thoughts as to what might be going on?