Scan/Scroll performance and cache

vad · March 11, 2016, 3:29pm

We're running Elasticsearch 1.7.5 and we have an API that, given a few filters (API-level, an abstraction over what's stored in ES) returns a list of matching doc ids. This list can be 100k ids and that's why we're using a scan/scroll.

What I've noticed is that the first time we perform the scan/scroll with a few filters it takes ~25s to scroll over all the results (size=2k, 6 shards, query with 46k results). If we repeat the same query a second time it takes the same time, probably because we have replica=1 and it's hitting the other shard. The third time it takes 4s.

Why this difference? Ok, there's the filter cache, but that is also created with a normal search (not scan) that takes milliseconds. And I'm using _source=false, therefore it shouldn't be hitting the disk to retrieve data, because I only need ids.

Related question: is it ok to use scan/scroll for this? I only need ids, can I switch to a normal _search?

Yet another question: can I turn off things that bloat the response like the index name and the score that are in every hit?

Thank you

warkolm · March 12, 2016, 8:44am

At a guess, cause I don't know for sure, it's the FS cache keeping the files in out-of-heap RAM.

nik9000 · March 12, 2016, 2:07pm

Probably fs caching like mark says. You can use response filtering if you
want to remove bits of the response.

jprante · March 12, 2016, 2:50pm

Exactly, one reason why a search result set construction may be slow is relevance scoring. In ES 2.x you can use sort parameter _doc to switch off scoring while scrolling: Request body search | Elasticsearch Guide [8.11] | Elastic

vad · March 14, 2016, 10:35am

Isn't _doc useful only for "scroll only" requests? My guess was that's not useful for scans, but i may be wrong (the doc is not clear about this IMHO)

vad · March 14, 2016, 10:36am

Oh, response filtering, thank you!

vad · March 14, 2016, 10:38am

That's possible. I'll monitor disk I/O then. Thank you

vad · March 15, 2016, 10:24am

Back to your response... FS caching of what exactly? Not of the indexes because a "normal" query runs in milliseconds, and that needs to read the whole reverse index too, at least of the two fields I'm stressing.

Thanks

warkolm · March 15, 2016, 8:04pm

The OS caches often read files in free memory.

vad · March 15, 2016, 8:17pm

Ok, but shouldn't a search take the same amount of time? I mean, a scan/scroll does 2 things:

it filters the results, to find all possible matches, like a normal search
it returns ids of every result, unlike a normal search

If the scan takes 2 order of magnitude more of a normal search, and point 1 is the same, it means point 2 is extremely slow. Since I use _source=false, does it mean it's reading the _source to get the _id?

Using fatrace I see there are a lot of reads on .fdt files (9k reads per shard, 6 shards, 13k results) that in my understanding doesn't store "field data" a-la ElasticSearch but the _source I guess. This would confirm my guess...

What about _uid? Can I ask that instead of _id? Would it be better?

Thank you in advance

vad · April 4, 2016, 12:35pm

An update: using fincore I see that ES is reading .fdt files, even if I'm using _source=false in the scan/scroll. Isn't this a bug?

Thank you

Topic		Replies	Views
Scan and scroll performance with IDs query Elasticsearch	6	3438	July 5, 2017
Retrieving over a million records in Elasticsearch Elasticsearch	10	28100	July 5, 2017
Scan/Scroll performance degrading logarithmically Elasticsearch	4	1265	July 5, 2017
Performance impact of returning large result sets Elasticsearch	3	4301	July 5, 2017
ES is slow when I try to return a huge result set Elasticsearch	8	4144	July 6, 2017

Scan/Scroll performance and cache

Related topics