Scrolling & Resource usage (lack of close()..)

Hey all,

Scrolling, which keeps resources on the nodes around for a time limit
specified in the scroll request, has no ability for the clinte to close(),
or 'release' the resources once finished, as I see it from the high level.
Subtle reading of the docs and code seem to indicate though that if a
Scrolling client iterates over the entire collection of results, that once
that loop exits, the resources are automatically closed for the scrol.

Is that true?

We have a case where an API style request in our application needs to
return all results, and we'd like to use ES to perform the search
interally, and be able to scroll through all the results to satisfy the API
answer, but the lack of clarity of closing/freeing up the resources make
this unclear if it's a good idea or not.

Anyone have any good info on this?

cheers,

Paul

Hi Paul

Scrolling, which keeps resources on the nodes around for a time limit
specified in the scroll request, has no ability for the clinte to
close(), or 'release' the resources once finished, as I see it from
the high level. Subtle reading of the docs and code seem to indicate
though that if a Scrolling client iterates over the entire collection
of results, that once that loop exits, the resources are automatically
closed for the scrol.

Is that true?

That is correct. The "view" on search when you start your scroll request
is maintained until either (1) the scroll request finishes or (2) or the
scroll times out.

Note: on the initial scroll request and on EACH SUBSEQUENT pull of
scroll results, you pass a scroll=TIME parameter. This means that the
timeout should be sufficient to process the results from a single pull,
not all the results in ES. Every time you pull another set of results
from the scroll, it extends the timeout to now() + timeout.

So lets say that you are sure that you can finish parsing the first X
results in 20 seconds, set your scroll timeout to eg '30s'. Every time
you pull the next batch of results, you pass scroll=30s which ensures
that the scroll stays live for another 30s from now.

We have a case where an API style request in our application needs to
return all results, and we'd like to use ES to perform the search
interally, and be able to scroll through all the results to satisfy
the API answer, but the lack of clarity of closing/freeing up the
resources make this unclear if it's a good idea or not.

Scrolling can be very expensive. To return all the results in ES, it is
better to combine it with search_type=scan, which is very efficient.
The downside being that you can't sort with scan requests.

clint

Thanks. Yeah I know getting all results is expensive but our API is what it
is now. Our old search lframework supported this and it was expensive
comparative to a traditional Top X result which Lucene is great at.

Is a scroll better or worse than doing subsequent paging of searches
including going to the very last page? scroll should be better than that?

On Friday, 20 July 2012, Clinton Gormley wrote:

Hi Paul

Scrolling, which keeps resources on the nodes around for a time limit
specified in the scroll request, has no ability for the clinte to
close(), or 'release' the resources once finished, as I see it from
the high level. Subtle reading of the docs and code seem to indicate
though that if a Scrolling client iterates over the entire collection
of results, that once that loop exits, the resources are automatically
closed for the scrol.

Is that true?

That is correct. The "view" on search when you start your scroll request
is maintained until either (1) the scroll request finishes or (2) or the
scroll times out.

Note: on the initial scroll request and on EACH SUBSEQUENT pull of
scroll results, you pass a scroll=TIME parameter. This means that the
timeout should be sufficient to process the results from a single pull,
not all the results in ES. Every time you pull another set of results
from the scroll, it extends the timeout to now() + timeout.

So lets say that you are sure that you can finish parsing the first X
results in 20 seconds, set your scroll timeout to eg '30s'. Every time
you pull the next batch of results, you pass scroll=30s which ensures
that the scroll stays live for another 30s from now.

We have a case where an API style request in our application needs to
return all results, and we'd like to use ES to perform the search
interally, and be able to scroll through all the results to satisfy
the API answer, but the lack of clarity of closing/freeing up the
resources make this unclear if it's a good idea or not.

Scrolling can be very expensive. To return all the results in ES, it is
better to combine it with search_type=scan, which is very efficient.
The downside being that you can't sort with scan requests.

clint