ES with 100% CPU

Hi,

We're using ES 1.7.5.

Stacks look like this:

   sun.nio.ch.NativeThread.current(Native Method)
   sun.nio.ch.NativeThreadSet.add(NativeThreadSet.java:46)
   sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:737)
   sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
   org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
   org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
   org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:140)
   org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116)
   org.apache.lucene.codecs.lucene41.ForUtil.readBlock(ForUtil.java:207)
   org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs(Lucene41PostingsReader.java:404)
   org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc(Lucene41PostingsReader.java:446)
   org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:111)
   org.elasticsearch.common.lucene.search.XBooleanFilter.getDocIdSet(XBooleanFilter.java:83)
   org.elasticsearch.common.lucene.search.AndFilter.getDocIdSet(AndFilter.java:54)
   org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(ApplyAcceptedDocsFilter.java:46)
   org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:157)
   org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.filteredScorer(FilteredQuery.java:542)
   org.apache.lucene.search.FilteredQuery$FilterStrategy.filteredBulkScorer(FilteredQuery.java:504)
   org.apache.lucene.search.FilteredQuery$1.bulkScorer(FilteredQuery.java:150)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:618)
   org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
   org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
   org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
   org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
   org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
   org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

Our queries are not very complex (Use a few filters), but they also sort on a field that is defined as DocValues (to avoid fielddata memory).

Any way we can improve this?

Thanks,
Roy.

Hello,
What's your OS, index.store.type setting?
Do you have spare ram on that machine?
How often 'few filters' are changed? i.e. is there an opportunity for cache hit?

Most of the filters are cached, 1 or 2 of them are not since they constantly change.
OS is Ubuntu 14.04
index.store.type is the default, we haven't set anything specific.

Regarding RAM - The machine has 120GB RAM, 30GB of which is used for heap, the rest is used for filesystem cache.

Giving that stacktrace, switching to mmapfs might help to reduce cpu load.

Is it something that can be changed with a restart, or does it require to recreate and reindex the data?

I think I got my answer from here:
http://elasticsearch-users.115913.n3.nabble.com/How-to-run-elastic-from-memory-td4031249.html

It's possible to just restart the cluster for the new settings.
However, something is still a bit unclear - is it OK to run with a mixed configuration? e.g. 1 server running with mmapfs while other servers are running with default? To test the configuration out before rolling it to all of the nodes.

Thanks for the help.
Roy.

you don't need to reindex. restart is enough.
this independent across nodes, you can experiment with the only one of them .

The setting change made things much worse in terms of IO:

Flush, Refresh & Search - Fetch times all grew by x5 times, some by even x10.

Roy.

Any other ideas on how to improve sort performance? We're pretty sure that's the root cause.

The puzzle is that, the stack trace doesn't have anything about sorting. Can it happen that query has a kilometre-long "multiterm" query? Do you update docs really much? Does 30G heap are really necessary? Does filesystem cache caches the index files? Another strange thing us that switching niofs to mmap impacts reading, but flush is about writing.

What I see here is that the scorer goes over a lot of docs, and I guess that is because of sorting, not sure if I got that right.

for example, scorer goes through all docs even without sorting at least to count number of hits. But can you confirm that query with sorting takes much more time than same query without sorting (without cache). Btw, what's in that docvalues field? isn't it really long?

From what I understand, when sorting, the scorer goes over all docs (after applying the filter, but before sorting and taking only a number of hits), so it needs to go over all docs that match the query, even if you take only a smaller # of hits.

The value is a timestamp field.

Another example of a stacktrace:

   75.8% (379ms out of 500ms) cpu usage by thread 'elasticsearch[AAA][search][T#22]'
     10/10 snapshots sharing following 22 elements
       org.elasticsearch.common.lucene.search.XBooleanFilter.getDocIdSet(XBooleanFilter.java:248)
       org.elasticsearch.common.lucene.search.XBooleanFilter.getDocIdSet(XBooleanFilter.java:83)
       org.elasticsearch.common.lucene.search.XBooleanFilter.getDocIdSet(XBooleanFilter.java:83)
       org.elasticsearch.common.lucene.search.AndFilter.getDocIdSet(AndFilter.java:54)
       org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(ApplyAcceptedDocsFilter.java:46)
       org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:157)
       org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.filteredScorer(FilteredQuery.java:542)
       org.apache.lucene.search.FilteredQuery$FilterStrategy.filteredBulkScorer(FilteredQuery.java:504)
       org.apache.lucene.search.FilteredQuery$1.bulkScorer(FilteredQuery.java:150)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:618)
       org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:117)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
       org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
       org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
       org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)

it seems like one of filter clauses yields many hits and fully evaluate them too often, can you exclude it from the eagerly cached filters?

Is there any way to determine what filter that is?

it looks there is no Profile API at 1.7. You can manually check num hits under frequent clauses or experiment with "_cache": false

Got it, thanks.
Out of curiosity - if there's a filter that is constantly being performed, but matches a large range of documents, it's better not to cache it? Even though it's performed multiple times?

sometimes it makes a lot of sense