I'm trying to reindex documents of widely variable size (from a few bytes
to a few outliers over 200 MB). I want to keep my scroll size as high as
possible to maximize throughput, but I need to keep the returned result set
under a memory threshold, so when those outliers show up I want the scroll
size temporarily tapered. Is that possible? Could it be implemented in the
future? I discovered that a scroll cannot be "retried" if it comes back too
large - once some documents are returned, they can never be returned again
for the same scroll - but in a 2011 comment Shay alluded it might be doable:
Shay also notes it opens a can of problems. Each node would have keep an
extended state of the scroll, with two continuations, the current one for
replay, and the next one. Also it would mean that if only a single node
can't respond successfully (for whatever reason), all other nodes would
have to replay old responses. During replay, other nodes could then also
return failed scroll responses, and the whole scan/scroll could enter a
loop if a client tries again and again.
With the current scan/scroll, each node can release the allocated scroll
response resources immediately after returning them to the client, and can
happily continue.
That wasn't my question though, just a note. Maybe I shouldn't have
included it if it was going to distract. I'm asking if there is a way to
set size on scroll in bytes instead of documents.
Sure, you could write an alternative implementation
of org.elasticsearch.search.fetch.FetchSearchResult
that stops fetching search results if they exceed a limit.
Because ES shards do not know about the byte size of the final result the
client sees, you would have to declare an internal estimated byte limit per
shard.
There is an edge case where bytes instead of docs doesn't help much, since
even a single doc could take gigabytes.
All in all, I am not sure how much the benefit is, compared to a
scan/scroll over huge docs with setting the scroll size to the minimum of 1.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.