High performance penalty, when size in query is increased

Hi

I have a query on the form:

{ 
 "size" : 10,
  "query" : {
    <Calling a plugin with different params, which filters and scores>
  },
  "_source" : {
    "includes" : [ "title" ]
  } 
}

When I increase the size, the time for the query is increasing quite much.
The total number of hits is 485.
The times I see is:

10: ~60ms
20: ~75ms
30: ~105ms
40: ~140ms
50: ~150ms
100: ~220ms
200: ~460ms
300: ~680ms
400: ~850ms
500: ~2300ms

What I do not understand is, why it is getting so much slower.
I guess the filtering, as well as the scoring takes the same amount of time, no matter how many of the results you extract, since it has to score alle documents, to know which documents to return?

And I only extract the title which is a small string, so it should not fill much?

Any help to make me understand Elasticsearch and this problem is much appreciated :slight_smile:


Info about the setup:

4 nodes in cluster

Shards: 4

~40GB -> 10GB pr shard

Total number of documents:
~180000

Each document is quite large
But we are only extracting title (via the source.includes param)

Elastic Version:
{
"version" : {
"number" : "1.7.6",
"build_hash" : "c730b59357f8ebc555286794dcd90b3411f517c9",
"build_timestamp" : "2016-11-18T15:21:16Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
}
}

Echoing Issues · elastic/elasticsearch · GitHub

Source filtering adds an extra overhead compares with no filtering at all. As you said, in order to correctly filters the fields the source of every search hit must be loaded, parsed and then filtered. On 1000 hits that represent a fair amount of work.

I'd try to compare your results to what happens when you turn source filtering off, if the reason you are filtering for is not to redude network traffic. Also this advice from the issue mentioned might help:

... did you try to use the Response Filtering feature? I think it can filters search hits source now. It should be more efficient in your case because it only filters the "output", resulting in less data to transfert over the network.

Thank you Christoph!

I found the github issue aswell, and it was exactly what was the case.
If someone should end here - I found that using stored fields, can be a solution to this problem.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.