Scan and scroll performance with IDs query

I've come across a problem when retrieving large numbers of records by ID from my elasticsearch index. As background, my index contains about 42 million docs (total about 250Gb) in 64 shards across 8 data nodes and a single client node to which my external transport clients connect. The data nodes have 32 Gb memory each (16Gb heap) and the client has 4Gb (all heap). I'm using ElasticSearch 2.2.0.

If I run a simple single term query that finds a subset of 66k documents (after clearing caches), I can scan/scroll from a Java transport client in about 30 seconds, which is pretty decent. I use a scroll size of 1000 and timeout of 60s. Each scan/scroll iteration takes about 300-400 ms to prepare the scroll and 2-5ms to process the hits.

If I instead take the 66k document IDs for this search set and run the same scan/scroll but with an IDs query, performance drops like a stone, with each successive prepareSearchScroll hitting 6-8s. As far as I can see, the more IDs you have, the worse the prepareSearchScroll performance gets - for 1k, the scroll is a few 100 ms, for 10k, its 1.5s, for 20k its 3s. Because the number of scrolls and the length of time for each scroll increases, ID-based queries like this simply don't scale for me.

I know I can probably get better performance by splitting my big set of IDs into chunks and treating each chunk separately, but am I missing a trick here? My uneducated guess would that these large query strings aren't being cached - is this right? Is there something I can do.

Any words of wisdom very welcome!

You should ensure to disable sorting by adding a sort clause on _doc https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html

thanks, will give that a whirl! I'll admit this rings a bell from somewhere, but never connected the two.

Sorting by _doc is AFAIK the ES 2.x equilvalent of using the scan scroll type in ES 1.x.

In our own testing sorting by _doc gave at least 5x and usually better improvement :slight_smile:

yes, that made a big difference! I find it slightly counterintuitive to have to specify a sort in order to not sort. I'm also puzzled why I saw this with my big terms list and not the single term query to get the same size set, but this may reflect index order I guess.

Looking further, its still not right - when trying this I'd inadvertently increased the scroll size to 60k at the same time, which was fine for a small set of fields but obviously caused problems with larger numbers of documents. With _doc, and the original scroll of 1k, I don't see any improvement, and if anything, its slightly worse :frowning: