Performance impact of returning large result sets

Hello.

I would like to use Elasticsearch as a way to retrieve the top 1000-10000 documents for a query and then applying some post-processing to those documents.

The problem is that performance seems to drop a lot when incrementing the size of the result set from 10 to 1000 and 10000.

To measure performance I am using simple queries like this:

 GET /index/type/_search
{
   "fields":[],
   "query" : {
      "filtered" : { 
         "filter" : {
            "bool" : {
              "must" : [
                 { "term" : {"title" : "hello"}}, 
                 { "term" : {"title" : "elasticsearch"}} 
              ]
           }
         }
      }
   },
   "size": 10
}

I am only interested in the documents ids.

After running 1000 queries from the JAVA API I got the following response times on average:

size = 10 -> 22 miliseconds
size = 100 -> 179 miliseconds
size = 1000 -> 288 miliseconds
size = 10000 -> 483 miliseconds

I do not understand why there is so much difference. Since I am only asking for the id of the documents, there should not be much overhead of fetching documents from disk.

Using scroll does not seems to provide much better results.

SOME ADDITIONAL INFORMATION

  • I am running ES and the tests on the same computer.
  • I have 16gb of memory.
  • I am running ES as: ./elasticsearch -Xmx8g -Xms8g
  • Documents in index: 19 million
  • Size of the index: 3gb
  • Shards configuration: default

Thanks a lot for the help!

You might want to learn about deep paging. Under the hood, Elasticsearch needs to build a ranked set of results on each shard. So asking for 10K results, for normal results, means each shard returning 10K results to the node handling the search. This node must then sort through all 10K, throwing out most of the results to return the result set for you.

That all being said, you're using filters. I'm not sure if that matters. But you can actually take advantage of another feature to pull back large results sets: the scan and scroll API. This is probably a better feature for you. I'd experiment with that to see if it was more appropriatte for your problem.

Hi, softwaredoug and thank you very much for your answer.

I understand the problem with deep paging but as you said, I am just using filters and there is no sorting required. I have also tried using scan and scroll. It is a little faster but still too slow for what I am trying to achieve.