Many thanks Jörg.
Further questions/comments inline:
- yes
Thanks,
- facet/aggregations are not very useful while scrolling (I doubt they
even work at all) because scrolling works on shard level and aggregations
work on indices level
If they are not expected to work, would it make sense to either:
- prevent aggregation/facet requests in conjunction with scroll
requests (ie give an error to the user)
- Simply not execute them?
If it doesn't make sense, would it be better to not return any
aggregation/facet results at all?
- a scroll request takes resources. The purpose of ClearScrollRequest is
to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.
Do you mean the keepAlive time? So, does the scroll (and its resources)
always remain for the duration of the keepAlive (since the last request on
that scroll) regardless of whether the end of the scroll was reached or not?
I read the following (from the documentation) to imply that reading to the
end of the scroll had the effect of "aborting" and therefore cleaning up
resources.
Besides consuming the scroll search until no hits has been returned a
scroll search can also be aborted by deleting the scroll_id
So, just to confirm, reading to the end of the results does nothing in
terms of bringing about the cleanup of the scroll? Its either the TTL or
the ClearScrollRequest that brings about the cleanup of resources.
Is there any downside to calling ClearScrollRequest explicitly?
(I am inclined to call it explicitly when the end of the scroll is reached
in order clean up resources asap)
- yes, the scroll id is an encoding of the combined state of all the
shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.
Thanks for the explanation.
A null scroll ID is a matter of API design. By using hit length check for
0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.
Agreed. Its a matter of API style/design.
The only issue I have with checking hits.length is that depending on the
SearchType, sometimes hits.length==0 does not mean the end of the results
(e.g. SearchType.SCAN). Its the lack of consistency that bothers me about
it. It requires the code that handles results to be aware of a detail of
the request.
My case for using scrollId is that:
The scrollId is already null if no scroll is requested.
For this reason, (IMO) scrollId==null would be a more consistent indicator
of no scrolling required - or no further scrolling required. Also it would
re-enforce the notion that the user should always use/observe the returned
scrollId - they would have to.
Cheers,
-Nick
On Wednesday, 18 June 2014 00:04:06 UTC+1, Jörg Prante wrote:
-
yes
-
facet/aggregations are not very useful while scrolling (I doubt they
even work at all) because scrolling works on shard level and aggregations
work on indices level
-
a scroll request takes resources. The purpose of ClearScrollRequest is
to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.
-
yes, the scroll id is an encoding of the combined state of all the
shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.
Scrolling is very different from search, because there is a shard-level
machinery that iterates over the Lucene segments and keep them open. This
tends to ramp up lots of server-side resources, which may long-lived - a
challenge for resource management. There is a reaper thread that wakes up
from time to time to take care of stray scroll searches. You observed this
as a "time delay". Ordinary search actions never keep resources open at
shard level.
Using scroll search for creating large CSV exports is adequate because
this iterates through the result set doc by doc. But replacing a
full-fledged search that has facets/filters/aggregations/sorting with a
scroll search, you will only create large overheads (if it is even
possible).
A null scroll ID is a matter of API design. By using hit length check for
0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.
Jörg
On Tue, Jun 17, 2014 at 7:46 PM, mooky <nick.mi...@gmail.com <javascript:>
wrote:
Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).
- Does it make sense to execute a search request with scroll, but
SearchType != SCAN?
- Does it make sense to execute a search request with scroll, and
also with facet/aggregations?
- What is the difference between scrolling to the end of the results
(ie calling until hits.length ==0) and issuing a specific
ClearScrollRequest? It appears to me that the ClearScrollRequest
immediately clears the scroll - whereas there is some time delay before a
scroll is cleaned up after reaching the end of the results. ( I can see
this in my tests because the ElasticsearchIntegrationTest fails on teardown
unless I perform an explicit ClearScrollRequest or I put a delay of some
number of seconds). From reading the docs, I am not sure if this a bug or
expected behaviour.
- Does the scrollId represent the cursor, or the cursor
page/iteration state? I have read documentation/mailing list explanations
that have words to the effect "you must pass the scrollId from the previous
response into the subsequent request" - which suggests the id represents
some cursor state - ie performing a scroll request with a given scrollId
will always return the same results. My observation, however, is that the
scrollId does not change (ie I get back the same scrollId I passed in) so
each scroll request with the same scrollId advances the 'cursor' until no
results are returned. I have also read stuff on the mailing list that
implied multiple calls could be made in parallel with the same scrollId to
load all the results faster (which would imply the scrollId is not expected
to change). So which is correct?
To explain the background for my questions: I have two requirements :
-
I get an update event that leads me to go find items in the index that
need re-indexing. I perform a search on the index, I get the id's and I
load the original data from the source system(s) to reconstruct the
document and index it. This seems to be exactly what SCAN and SCROLL is
meant for. (However, the SCAN search type is different in that it always
returns zero hits from the original search request - only the scroll
requests seem to
-
The user normally performs a search, and naturally we limit how many
results we serve to the client. However, occasionally, the user wants to
return all the data for a given search/filter (say, to export to excel or
whatever), so it seems like a good idea to use the scroll rather than
paging through the results using from&size as we know we will get a
consistent results even if documents are being added/removed/updated on the
server.
From a functionality perspective, I want to make sure the scrolling
search request is the same as the non-scrolling search request so the user
gets the same results - so from a code perspective, ideally I really want
to make the codepath the same (save for adding the scroll keepAlive param).
However, perhaps there are things I perform with my normal search (e.g.
aggregations, SearchType.DEFAULT, etc) that just don't make sense when
scrolling?
Many thanks.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ad0f4c3f-fd11-4af6-b50a-bbf8f7e8695a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.