Scan and scroll performance with IDs query

danstaines · April 11, 2016, 10:24am

I've come across a problem when retrieving large numbers of records by ID from my elasticsearch index. As background, my index contains about 42 million docs (total about 250Gb) in 64 shards across 8 data nodes and a single client node to which my external transport clients connect. The data nodes have 32 Gb memory each (16Gb heap) and the client has 4Gb (all heap). I'm using ElasticSearch 2.2.0.

If I run a simple single term query that finds a subset of 66k documents (after clearing caches), I can scan/scroll from a Java transport client in about 30 seconds, which is pretty decent. I use a scroll size of 1000 and timeout of 60s. Each scan/scroll iteration takes about 300-400 ms to prepare the scroll and 2-5ms to process the hits.

If I instead take the 66k document IDs for this search set and run the same scan/scroll but with an IDs query, performance drops like a stone, with each successive prepareSearchScroll hitting 6-8s. As far as I can see, the more IDs you have, the worse the prepareSearchScroll performance gets - for 1k, the scroll is a few 100 ms, for 10k, its 1.5s, for 20k its 3s. Because the number of scrolls and the length of time for each scroll increases, ID-based queries like this simply don't scale for me.

I know I can probably get better performance by splitting my big set of IDs into chunks and treating each chunk separately, but am I missing a trick here? My uneducated guess would that these large query strings aren't being cached - is this right? Is there something I can do.

Any words of wisdom very welcome!

jprante · April 11, 2016, 1:02pm

You should ensure to disable sorting by adding a sort clause on _doc https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html

danstaines · April 11, 2016, 7:27pm

thanks, will give that a whirl! I'll admit this rings a bell from somewhere, but never connected the two.

aaron_ximm · April 11, 2016, 10:10pm

Sorting by _doc is AFAIK the ES 2.x equilvalent of using the scan scroll type in ES 1.x.

In our own testing sorting by _doc gave at least 5x and usually better improvement

danstaines · April 12, 2016, 6:32am

yes, that made a big difference! I find it slightly counterintuitive to have to specify a sort in order to not sort. I'm also puzzled why I saw this with my big terms list and not the single term query to get the same size set, but this may reflect index order I guess.

danstaines · April 12, 2016, 10:46am

Looking further, its still not right - when trying this I'd inadvertently increased the scroll size to 60k at the same time, which was fine for a small set of fields but obviously caused problems with larger numbers of documents. With _doc, and the original scroll of 1k, I don't see any improvement, and if anything, its slightly worse

Topic		Replies	Views
Scan/Scroll performance and cache Elasticsearch	11	3497	July 5, 2017
How to improve Scroll runtime for 5 billion record retrieval? Elasticsearch	3	421	May 11, 2020
Retrieving over a million records in Elasticsearch Elasticsearch	10	28416	July 5, 2017
Performance impact of returning large result sets Elasticsearch	3	4345	July 5, 2017
Speeding up deep pagination for large ids query Elasticsearch	1	94	February 20, 2024

Scan and scroll performance with IDs query

Related topics