I've come across a problem when retrieving large numbers of records by ID from my elasticsearch index. As background, my index contains about 42 million docs (total about 250Gb) in 64 shards across 8 data nodes and a single client node to which my external transport clients connect. The data nodes have 32 Gb memory each (16Gb heap) and the client has 4Gb (all heap). I'm using ElasticSearch 2.2.0.
If I run a simple single term query that finds a subset of 66k documents (after clearing caches), I can scan/scroll from a Java transport client in about 30 seconds, which is pretty decent. I use a scroll size of 1000 and timeout of 60s. Each scan/scroll iteration takes about 300-400 ms to prepare the scroll and 2-5ms to process the hits.
If I instead take the 66k document IDs for this search set and run the same scan/scroll but with an IDs query, performance drops like a stone, with each successive prepareSearchScroll hitting 6-8s. As far as I can see, the more IDs you have, the worse the prepareSearchScroll performance gets - for 1k, the scroll is a few 100 ms, for 10k, its 1.5s, for 20k its 3s. Because the number of scrolls and the length of time for each scroll increases, ID-based queries like this simply don't scale for me.
I know I can probably get better performance by splitting my big set of IDs into chunks and treating each chunk separately, but am I missing a trick here? My uneducated guess would that these large query strings aren't being cached - is this right? Is there something I can do.
Any words of wisdom very welcome!