Query performance issue - need help to investigate

Hi Folks,

I could need some help in investigating a general performance issue that is going on on my ES 2.1.0 cluster.

First my problem: simple queries such as match_all (default size: 10, no ordering) constantly take around ~2250ms even on subsequent executions.
This is on a single index with ~240,000,000 documents of pretty small document size (overall the index is ~50GB)
This seem to be a general issue which of course slows down all other kinds of queries and probably aggregations.

Here are some details about my setup:

  • Running Elasticsearch 2.1.0 on CentOS 6.7
  • 4 Node Cluster (bare metal)
  • 16GB Heap per instance with plenty of ram left for disk IO buffering
  • Time-based indices (weekly) with each having ~240,000,000 documents and 50GB per index
  • Each index got 2 shards with one replica (no routing)
  • Configuration is pretty much default apart from bootstrap.mlockall: true and some recovery options set
  • One storage path per instance (RAID10 spinning disks with pretty good read performance -700-800 MB/s for sequential reads)

While a match_all query runs there is no noticable disk activity (probably because it is either cached in ES or in the OS cache for subsequent reads).
Also I can not detect any spike in CPU usage.
Heap utilization is quite stable at ~50-60% for all instances.

I got the HQ Plugin installed and the only thing which does not look healthy there are the following (from Index Activity):

  • Search - Query: 67.41ms [query_time_in_millis / query_total ]
  • Search - Fetch: 29.75ms [fetch_time_in_millis / fetch_total]
  • Get - Total: 10.32ms [get.time_in_millis / get.total]
  • Get - Exists: 11.88ms [get.exists_time_in_millis / get.exists_total]
  • Refresh: 3001.68ms [refresh.total_time_in_millis / refresh.total]
  • Flush: 4371.25ms [flush.total_time_in_millis / flush.total]

To add: I only index once every week and am okay with the indexing performance (get 5000 docs per second with one single thread).
From the numbers this looks like a disk IO problem but all my tests on disk IO look very good (just tested with hdparm -tT and reading a chunk of data with dd).
Also as I mentioned this affects queries which are running multiple times and should be served from cache.

Can anyone help me to further investigate this?
Are there any other useful ES metrics that I could look at?

Also I tried to run a hot_threads query while the simple match_all runs for 2.5 seconds and here are the two outstanding threads:

:::
Hot threads at 2016-01-05T17:28:12.147Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
100.8% (503.7ms out of 500ms) cpu usage by thread 'elasticsearch[esnode-003][search][T#34]'
  7/10 snapshots sharing following 16 elements
    org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:226)
    org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:169)
    org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
    org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:772)
    org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:486)
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:324)
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:106)
    org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:363)
    org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:375)
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
    org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
    org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
  3/10 snapshots sharing following 15 elements
    org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:169)
    org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
    org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:772)
    org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:486)
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:324)
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:106)
    org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:363)
    org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:375)
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
    org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
    org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
::: 
Hot threads at 2016-01-05T17:28:12.148Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
 
101.0% (504.8ms out of 500ms) cpu usage by thread 'elasticsearch[esnode-004][search][T#21]'
  9/10 snapshots sharing following 17 elements
    org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:64)
    org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:226)
    org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:169)
    org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
    org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:772)
    org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:486)
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:324)
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:106)
    org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:363)
    org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:375)
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
    org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
    org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
  unique snapshot
    org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:225)
    org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:169)
    org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
    org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:772)
    org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:486)
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:324)
    org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:106)
    org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:363)
    org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:375)
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
    org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
    org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
    org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)

Another info: Restarting the cluster brings the simple match_all query down to 1000ms which still seems very high. This seem to increase ofer time after we ran some more complicated queries and aggregation.

The hot threads indicate that the execution of your match_all query is cpu-bound. Elasticsearch 2.2 will bring some performance improvements to the match_all query (see annotation R at https://benchmarks.elastic.co/index.html#search_times).

Otherwise there isn't really anything you can do but buying faster hardware: it just takes time to visit all 240M matching documents.

Have you tried using a larger number of shards for each index?

Thanks for your responses!

@jpountz okay I get that but I don't understand why a match_all query would ever 'visit' all documents?
Isn't is just like: please give me any 10 documents + the overall count (which I thought would be available without visiting all documents).
The Threads seem to spend most time on org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:226).
Without knowing the internals of the lucene api ... why is match_all scoring at all? ES Will anyway just return a static score or?
Good to know about the upcoming improvement. Theoretically I don't have any business case for match_all.
Was using is to measure if there may be a general issue with my cluster because I also see some other queries being very slow (e.g. an ids filter with 10 IDs taking 1,5-3 seconds).
Will try to get a hot_threads output of my other slow queries to see if this is the same issue.

@Christian_Dahlqvist not yet. Should I? You ask this because you think it may help to parallelize work and by that speed up the process?
Including the replicas I already have one shard per ES instance.

Having more shards should increase parallelism and ensure all nodes participate in the query. Query performance also tend to depend on shard size, so having a larger number of smaller shards may help performance. I would therefore recommend reindexing into an index with e.g. 8 shards and see if/how that affects performance.

Another benefit of having more shards is that it allows Elasticsearch to more evenly redistribute data on node failure.

Indeed match_all could be optimized this way as the number of live documents is cached by the index. Elaseicsearch will do this already if you set size=0 Optimize counts on simple queries. by jpountz · Pull Request #13037 · elastic/elasticsearch · GitHub. I would like to generalize to more cases but this needs some changes if we want to keep the optimization manageable.

This method actually means "collect all matching documents and give them a score". In the match_all case however, giving a score is just a matter of returning a constant.

That is weird indeed. The hot threads wil help hopefully.

Yes because Elasticsearch searches a given shard in a single thread (having replicas or not does not matter). However note that having more shards can also cause problems like less overall compression and more work for elasticsearch to manage all the shards, so you would need to find the sweet spot.