We're running Elasticsearch 1.7.5 and we have an API that, given a few filters (API-level, an abstraction over what's stored in ES) returns a list of matching doc ids. This list can be 100k ids and that's why we're using a scan/scroll.
What I've noticed is that the first time we perform the scan/scroll with a few filters it takes ~25s to scroll over all the results (size=2k, 6 shards, query with 46k results). If we repeat the same query a second time it takes the same time, probably because we have replica=1 and it's hitting the other shard. The third time it takes 4s.
Why this difference? Ok, there's the filter cache, but that is also created with a normal search (not scan) that takes milliseconds. And I'm using _source=false, therefore it shouldn't be hitting the disk to retrieve data, because I only need ids.
Related question: is it ok to use scan/scroll for this? I only need ids, can I switch to a normal _search?
Yet another question: can I turn off things that bloat the response like the index name and the score that are in every hit?
Back to your response... FS caching of what exactly? Not of the indexes because a "normal" query runs in milliseconds, and that needs to read the whole reverse index too, at least of the two fields I'm stressing.
Ok, but shouldn't a search take the same amount of time? I mean, a scan/scroll does 2 things:
it filters the results, to find all possible matches, like a normal search
it returns ids of every result, unlike a normal search
If the scan takes 2 order of magnitude more of a normal search, and point 1 is the same, it means point 2 is extremely slow. Since I use _source=false, does it mean it's reading the _source to get the _id?
Using fatrace I see there are a lot of reads on .fdt files (9k reads per shard, 6 shards, 13k results) that in my understanding doesn't store "field data" a-la ElasticSearch but the _source I guess. This would confirm my guess...
What about _uid? Can I ask that instead of _id? Would it be better?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.