We're running Elasticsearch 1.7.5 and we have an API that, given a few filters (API-level, an abstraction over what's stored in ES) returns a list of matching doc ids. This list can be 100k ids and that's why we're using a scan/scroll.
What I've noticed is that the first time we perform the scan/scroll with a few filters it takes ~25s to scroll over all the results (size=2k, 6 shards, query with 46k results). If we repeat the same query a second time it takes the same time, probably because we have replica=1 and it's hitting the other shard. The third time it takes 4s.
Why this difference? Ok, there's the filter cache, but that is also created with a normal search (not scan) that takes milliseconds. And I'm using _source=false, therefore it shouldn't be hitting the disk to retrieve data, because I only need ids.
Related question: is it ok to use scan/scroll for this? I only need ids, can I switch to a normal _search?
Yet another question: can I turn off things that bloat the response like the index name and the score that are in every hit?
Thank you