Hello.
I would like to use Elasticsearch as a way to retrieve the top 1000-10000 documents for a query and then applying some post-processing to those documents.
The problem is that performance seems to drop a lot when incrementing the size of the result set from 10 to 1000 and 10000.
To measure performance I am using simple queries like this:
GET /index/type/_search
{
"fields":[],
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : {"title" : "hello"}},
{ "term" : {"title" : "elasticsearch"}}
]
}
}
}
},
"size": 10
}
I am only interested in the documents ids.
After running 1000 queries from the JAVA API I got the following response times on average:
size = 10 -> 22 miliseconds
size = 100 -> 179 miliseconds
size = 1000 -> 288 miliseconds
size = 10000 -> 483 miliseconds
I do not understand why there is so much difference. Since I am only asking for the id of the documents, there should not be much overhead of fetching documents from disk.
Using scroll does not seems to provide much better results.
SOME ADDITIONAL INFORMATION
- I am running ES and the tests on the same computer.
- I have 16gb of memory.
- I am running ES as: ./elasticsearch -Xmx8g -Xms8g
- Documents in index: 19 million
- Size of the index: 3gb
- Shards configuration: default
Thanks a lot for the help!