Scroll Questions

mooky · June 17, 2014, 5:46pm

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

Does it make sense to execute a search request with scroll, but
SearchType != SCAN?
Does it make sense to execute a search request with scroll, and also
with facet/aggregations?
What is the difference between scrolling to the end of the results
(ie calling until hits.length ==0) and issuing a specific
ClearScrollRequest? It appears to me that the ClearScrollRequest
immediately clears the scroll - whereas there is some time delay before a
scroll is cleaned up after reaching the end of the results. ( I can see
this in my tests because the ElasticsearchIntegrationTest fails on teardown
unless I perform an explicit ClearScrollRequest or I put a delay of some
number of seconds). From reading the docs, I am not sure if this a bug or
expected behaviour.
Does the scrollId represent the cursor, or the cursor page/iteration
state? I have read documentation/mailing list explanations that have words
to the effect "you must pass the scrollId from the previous response into
the subsequent request" - which suggests the id represents some cursor
state - ie performing a scroll request with a given scrollId will always
return the same results. My observation, however, is that the scrollId does
not change (ie I get back the same scrollId I passed in) so each scroll
request with the same scrollId advances the 'cursor' until no results are
returned. I have also read stuff on the mailing list that implied multiple
calls could be made in parallel with the same scrollId to load all the
results faster (which would imply the scrollId is not expected to
change). So which is correct?

To explain the background for my questions: I have two requirements :

I get an update event that leads me to go find items in the index that
need re-indexing. I perform a search on the index, I get the id's and I
load the original data from the source system(s) to reconstruct the
document and index it. This seems to be exactly what SCAN and SCROLL is
meant for. (However, the SCAN search type is different in that it always
returns zero hits from the original search request - only the scroll
requests seem to
The user normally performs a search, and naturally we limit how many
results we serve to the client. However, occasionally, the user wants to
return all the data for a given search/filter (say, to export to excel or
whatever), so it seems like a good idea to use the scroll rather than
paging through the results using from&size as we know we will get a
consistent results even if documents are being added/removed/updated on the
server.
From a functionality perspective, I want to make sure the scrolling search
request is the same as the non-scrolling search request so the user gets
the same results - so from a code perspective, ideally I really want to
make the codepath the same (save for adding the scroll keepAlive param).
However, perhaps there are things I perform with my normal search (e.g.
aggregations, SearchType.DEFAULT, etc) that just don't make sense when
scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mooky · June 17, 2014, 5:56pm

One more question I forgot:

Rather than looking at hits.length to know if the end of the scroll has
been reached, would it not be better to return a null scrollId when the end
of the cursor has been reached? On the surface it seems that would be
a) more intuitive
b) be the same regardless of which SearchType you are using
c) not be affected by the search itself returning zero results

Cheers.

On Tuesday, 17 June 2014 18:46:07 UTC+1, mooky wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

Does it make sense to execute a search request with scroll, but
SearchType != SCAN?

Does it make sense to execute a search request with scroll, and
also with facet/aggregations?

What is the difference between scrolling to the end of the results
(ie calling until hits.length ==0) and issuing a specific
ClearScrollRequest? It appears to me that the ClearScrollRequest
immediately clears the scroll - whereas there is some time delay before a
scroll is cleaned up after reaching the end of the results. ( I can see
this in my tests because the ElasticsearchIntegrationTest fails on teardown
unless I perform an explicit ClearScrollRequest or I put a delay of some
number of seconds). From reading the docs, I am not sure if this a bug or
expected behaviour.

Does the scrollId represent the cursor, or the cursor
page/iteration state? I have read documentation/mailing list explanations
that have words to the effect "you must pass the scrollId from the previous
response into the subsequent request" - which suggests the id represents
some cursor state - ie performing a scroll request with a given scrollId
will always return the same results. My observation, however, is that the
scrollId does not change (ie I get back the same scrollId I passed in) so
each scroll request with the same scrollId advances the 'cursor' until no
results are returned. I have also read stuff on the mailing list that
implied multiple calls could be made in parallel with the same scrollId to
load all the results faster (which would imply the scrollId is not expected
to change). So which is correct?

To explain the background for my questions: I have two requirements :

I get an update event that leads me to go find items in the index that
need re-indexing. I perform a search on the index, I get the id's and I
load the original data from the source system(s) to reconstruct the
document and index it. This seems to be exactly what SCAN and SCROLL is
meant for. (However, the SCAN search type is different in that it always
returns zero hits from the original search request - only the scroll
requests seem to

The user normally performs a search, and naturally we limit how many
results we serve to the client. However, occasionally, the user wants to
return all the data for a given search/filter (say, to export to excel or
whatever), so it seems like a good idea to use the scroll rather than
paging through the results using from&size as we know we will get a
consistent results even if documents are being added/removed/updated on the
server.
From a functionality perspective, I want to make sure the scrolling search
request is the same as the non-scrolling search request so the user gets
the same results - so from a code perspective, ideally I really want to
make the codepath the same (save for adding the scroll keepAlive param).
However, perhaps there are things I perform with my normal search (e.g.
aggregations, SearchType.DEFAULT, etc) that just don't make sense when
scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b6697426-9d3b-43e4-8c9e-cd14bf3c7859%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · June 17, 2014, 11:04pm

yes
facet/aggregations are not very useful while scrolling (I doubt they
even work at all) because scrolling works on shard level and aggregations
work on indices level
a scroll request takes resources. The purpose of ClearScrollRequest is
to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.
yes, the scroll id is an encoding of the combined state of all the
shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.

Scrolling is very different from search, because there is a shard-level
machinery that iterates over the Lucene segments and keep them open. This
tends to ramp up lots of server-side resources, which may long-lived - a
challenge for resource management. There is a reaper thread that wakes up
from time to time to take care of stray scroll searches. You observed this
as a "time delay". Ordinary search actions never keep resources open at
shard level.

Using scroll search for creating large CSV exports is adequate because this
iterates through the result set doc by doc. But replacing a full-fledged
search that has facets/filters/aggregations/sorting with a scroll search,
you will only create large overheads (if it is even possible).

A null scroll ID is a matter of API design. By using hit length check for
0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Jörg

On Tue, Jun 17, 2014 at 7:46 PM, mooky nick.minutello@gmail.com wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

Does it make sense to execute a search request with scroll, but
SearchType != SCAN?

Does it make sense to execute a search request with scroll, and
also with facet/aggregations?

What is the difference between scrolling to the end of the results
(ie calling until hits.length ==0) and issuing a specific
ClearScrollRequest? It appears to me that the ClearScrollRequest
immediately clears the scroll - whereas there is some time delay before a
scroll is cleaned up after reaching the end of the results. ( I can see
this in my tests because the ElasticsearchIntegrationTest fails on teardown
unless I perform an explicit ClearScrollRequest or I put a delay of some
number of seconds). From reading the docs, I am not sure if this a bug or
expected behaviour.

Does the scrollId represent the cursor, or the cursor
page/iteration state? I have read documentation/mailing list explanations
that have words to the effect "you must pass the scrollId from the previous
response into the subsequent request" - which suggests the id represents
some cursor state - ie performing a scroll request with a given scrollId
will always return the same results. My observation, however, is that the
scrollId does not change (ie I get back the same scrollId I passed in) so
each scroll request with the same scrollId advances the 'cursor' until no
results are returned. I have also read stuff on the mailing list that
implied multiple calls could be made in parallel with the same scrollId to
load all the results faster (which would imply the scrollId is not expected
to change). So which is correct?

To explain the background for my questions: I have two requirements :

I get an update event that leads me to go find items in the index that
need re-indexing. I perform a search on the index, I get the id's and I
load the original data from the source system(s) to reconstruct the
document and index it. This seems to be exactly what SCAN and SCROLL is
meant for. (However, the SCAN search type is different in that it always
returns zero hits from the original search request - only the scroll
requests seem to

The user normally performs a search, and naturally we limit how many
results we serve to the client. However, occasionally, the user wants to
return all the data for a given search/filter (say, to export to excel or
whatever), so it seems like a good idea to use the scroll rather than
paging through the results using from&size as we know we will get a
consistent results even if documents are being added/removed/updated on the
server.
From a functionality perspective, I want to make sure the scrolling search
request is the same as the non-scrolling search request so the user gets
the same results - so from a code perspective, ideally I really want to
make the codepath the same (save for adding the scroll keepAlive param).
However, perhaps there are things I perform with my normal search (e.g.
aggregations, SearchType.DEFAULT, etc) that just don't make sense when
scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGisnqcj9F9nUOJbojZtW9yE4bOgWwPHKaN7DbRyAJ_UA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

mooky · June 18, 2014, 9:28am

Many thanks Jörg.

Further questions/comments inline:

yes

Thanks,

facet/aggregations are not very useful while scrolling (I doubt they

even work at all) because scrolling works on shard level and aggregations
work on indices level

If they are not expected to work, would it make sense to either:

prevent aggregation/facet requests in conjunction with scroll
requests (ie give an error to the user)
Simply not execute them?

If it doesn't make sense, would it be better to not return any
aggregation/facet results at all?

a scroll request takes resources. The purpose of ClearScrollRequest is

to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.

Do you mean the keepAlive time? So, does the scroll (and its resources)
always remain for the duration of the keepAlive (since the last request on
that scroll) regardless of whether the end of the scroll was reached or not?

I read the following (from the documentation) to imply that reading to the
end of the scroll had the effect of "aborting" and therefore cleaning up
resources.

Besides consuming the scroll search until no hits has been returned a
scroll search can also be aborted by deleting the scroll_id

So, just to confirm, reading to the end of the results does nothing in
terms of bringing about the cleanup of the scroll? Its either the TTL or
the ClearScrollRequest that brings about the cleanup of resources.

Is there any downside to calling ClearScrollRequest explicitly?
(I am inclined to call it explicitly when the end of the scroll is reached
in order clean up resources asap)

yes, the scroll id is an encoding of the combined state of all the

shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.

Thanks for the explanation.

A null scroll ID is a matter of API design. By using hit length check for

0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Agreed. Its a matter of API style/design.
The only issue I have with checking hits.length is that depending on the
SearchType, sometimes hits.length==0 does not mean the end of the results
(e.g. SearchType.SCAN). Its the lack of consistency that bothers me about
it. It requires the code that handles results to be aware of a detail of
the request.

My case for using scrollId is that:
The scrollId is already null if no scroll is requested.
For this reason, (IMO) scrollId==null would be a more consistent indicator
of no scrolling required - or no further scrolling required. Also it would
re-enforce the notion that the user should always use/observe the returned
scrollId - they would have to.

Cheers,
-Nick

On Wednesday, 18 June 2014 00:04:06 UTC+1, Jörg Prante wrote:

yes

facet/aggregations are not very useful while scrolling (I doubt they
even work at all) because scrolling works on shard level and aggregations
work on indices level

a scroll request takes resources. The purpose of ClearScrollRequest is
to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.

yes, the scroll id is an encoding of the combined state of all the
shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.

Scrolling is very different from search, because there is a shard-level
machinery that iterates over the Lucene segments and keep them open. This
tends to ramp up lots of server-side resources, which may long-lived - a
challenge for resource management. There is a reaper thread that wakes up
from time to time to take care of stray scroll searches. You observed this
as a "time delay". Ordinary search actions never keep resources open at
shard level.

Using scroll search for creating large CSV exports is adequate because
this iterates through the result set doc by doc. But replacing a
full-fledged search that has facets/filters/aggregations/sorting with a
scroll search, you will only create large overheads (if it is even
possible).

A null scroll ID is a matter of API design. By using hit length check for
0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Jörg

On Tue, Jun 17, 2014 at 7:46 PM, mooky <nick.mi...@gmail.com <javascript:>

wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

Does it make sense to execute a search request with scroll, but
SearchType != SCAN?

Does it make sense to execute a search request with scroll, and
also with facet/aggregations?

What is the difference between scrolling to the end of the results
(ie calling until hits.length ==0) and issuing a specific
ClearScrollRequest? It appears to me that the ClearScrollRequest
immediately clears the scroll - whereas there is some time delay before a
scroll is cleaned up after reaching the end of the results. ( I can see
this in my tests because the ElasticsearchIntegrationTest fails on teardown
unless I perform an explicit ClearScrollRequest or I put a delay of some
number of seconds). From reading the docs, I am not sure if this a bug or
expected behaviour.

Does the scrollId represent the cursor, or the cursor
page/iteration state? I have read documentation/mailing list explanations
that have words to the effect "you must pass the scrollId from the previous
response into the subsequent request" - which suggests the id represents
some cursor state - ie performing a scroll request with a given scrollId
will always return the same results. My observation, however, is that the
scrollId does not change (ie I get back the same scrollId I passed in) so
each scroll request with the same scrollId advances the 'cursor' until no
results are returned. I have also read stuff on the mailing list that
implied multiple calls could be made in parallel with the same scrollId to
load all the results faster (which would imply the scrollId is not expected
to change). So which is correct?

To explain the background for my questions: I have two requirements :

I get an update event that leads me to go find items in the index that
need re-indexing. I perform a search on the index, I get the id's and I
load the original data from the source system(s) to reconstruct the
document and index it. This seems to be exactly what SCAN and SCROLL is
meant for. (However, the SCAN search type is different in that it always
returns zero hits from the original search request - only the scroll
requests seem to

The user normally performs a search, and naturally we limit how many
results we serve to the client. However, occasionally, the user wants to
return all the data for a given search/filter (say, to export to excel or
whatever), so it seems like a good idea to use the scroll rather than
paging through the results using from&size as we know we will get a
consistent results even if documents are being added/removed/updated on the
server.
From a functionality perspective, I want to make sure the scrolling
search request is the same as the non-scrolling search request so the user
gets the same results - so from a code perspective, ideally I really want
to make the codepath the same (save for adding the scroll keepAlive param).
However, perhaps there are things I perform with my normal search (e.g.
aggregations, SearchType.DEFAULT, etc) that just don't make sense when
scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ad0f4c3f-fd11-4af6-b50a-bbf8f7e8695a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mooky · June 18, 2014, 10:19am

Furthermore on using hits.length ==0,
Shard failure(s) can mean hits.length==0 but perhaps the end of the scroll.

On Tuesday, 17 June 2014 18:46:07 UTC+1, mooky wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

Does it make sense to execute a search request with scroll, but
SearchType != SCAN?

Does it make sense to execute a search request with scroll, and
also with facet/aggregations?

What is the difference between scrolling to the end of the results
(ie calling until hits.length ==0) and issuing a specific
ClearScrollRequest? It appears to me that the ClearScrollRequest
immediately clears the scroll - whereas there is some time delay before a
scroll is cleaned up after reaching the end of the results. ( I can see
this in my tests because the ElasticsearchIntegrationTest fails on teardown
unless I perform an explicit ClearScrollRequest or I put a delay of some
number of seconds). From reading the docs, I am not sure if this a bug or
expected behaviour.

Does the scrollId represent the cursor, or the cursor
page/iteration state? I have read documentation/mailing list explanations
that have words to the effect "you must pass the scrollId from the previous
response into the subsequent request" - which suggests the id represents
some cursor state - ie performing a scroll request with a given scrollId
will always return the same results. My observation, however, is that the
scrollId does not change (ie I get back the same scrollId I passed in) so
each scroll request with the same scrollId advances the 'cursor' until no
results are returned. I have also read stuff on the mailing list that
implied multiple calls could be made in parallel with the same scrollId to
load all the results faster (which would imply the scrollId is not expected
to change). So which is correct?

To explain the background for my questions: I have two requirements :

I get an update event that leads me to go find items in the index that
need re-indexing. I perform a search on the index, I get the id's and I
load the original data from the source system(s) to reconstruct the
document and index it. This seems to be exactly what SCAN and SCROLL is
meant for. (However, the SCAN search type is different in that it always
returns zero hits from the original search request - only the scroll
requests seem to

The user normally performs a search, and naturally we limit how many
results we serve to the client. However, occasionally, the user wants to
return all the data for a given search/filter (say, to export to excel or
whatever), so it seems like a good idea to use the scroll rather than
paging through the results using from&size as we know we will get a
consistent results even if documents are being added/removed/updated on the
server.
From a functionality perspective, I want to make sure the scrolling search
request is the same as the non-scrolling search request so the user gets
the same results - so from a code perspective, ideally I really want to
make the codepath the same (save for adding the scroll keepAlive param).
However, perhaps there are things I perform with my normal search (e.g.
aggregations, SearchType.DEFAULT, etc) that just don't make sense when
scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/788e5f30-2a7e-4777-9377-9357c283bf2b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mooky · June 19, 2014, 12:55pm

Further to (2). Would it be an improvement to have a different kind of
request for a scrolling search - that way the api could exclude items that
don't make sense (e.g. aggregations, facets, etc)

On Wednesday, 18 June 2014 10:28:06 UTC+1, mooky wrote:

Many thanks Jörg.

Further questions/comments inline:

yes

Thanks,

facet/aggregations are not very useful while scrolling (I doubt they

even work at all) because scrolling works on shard level and aggregations
work on indices level

If they are not expected to work, would it make sense to either:

prevent aggregation/facet requests in conjunction with scroll
requests (ie give an error to the user)

Simply not execute them?

If it doesn't make sense, would it be better to not return any
aggregation/facet results at all?

a scroll request takes resources. The purpose of ClearScrollRequest is

to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.

Do you mean the keepAlive time? So, does the scroll (and its resources)
always remain for the duration of the keepAlive (since the last request on
that scroll) regardless of whether the end of the scroll was reached or not?

I read the following (from the documentation) to imply that reading to the
end of the scroll had the effect of "aborting" and therefore cleaning up
resources.

Besides consuming the scroll search until no hits has been returned a
scroll search can also be aborted by deleting the scroll_id

So, just to confirm, reading to the end of the results does nothing in
terms of bringing about the cleanup of the scroll? Its either the TTL or
the ClearScrollRequest that brings about the cleanup of resources.

Is there any downside to calling ClearScrollRequest explicitly?
(I am inclined to call it explicitly when the end of the scroll is reached
in order clean up resources asap)

yes, the scroll id is an encoding of the combined state of all the

shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.

Thanks for the explanation.

A null scroll ID is a matter of API design. By using hit length check for

0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Agreed. Its a matter of API style/design.
The only issue I have with checking hits.length is that depending on the
SearchType, sometimes hits.length==0 does not mean the end of the results
(e.g. SearchType.SCAN). Its the lack of consistency that bothers me about
it. It requires the code that handles results to be aware of a detail of
the request.

My case for using scrollId is that:
The scrollId is already null if no scroll is requested.
For this reason, (IMO) scrollId==null would be a more consistent indicator
of no scrolling required - or no further scrolling required. Also it would
re-enforce the notion that the user should always use/observe the returned
scrollId - they would have to.

Cheers,
-Nick

On Wednesday, 18 June 2014 00:04:06 UTC+1, Jörg Prante wrote:

yes

facet/aggregations are not very useful while scrolling (I doubt they
even work at all) because scrolling works on shard level and aggregations
work on indices level

a scroll request takes resources. The purpose of ClearScrollRequest is
to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.

yes, the scroll id is an encoding of the combined state of all the
shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.

Scrolling is very different from search, because there is a shard-level
machinery that iterates over the Lucene segments and keep them open. This
tends to ramp up lots of server-side resources, which may long-lived - a
challenge for resource management. There is a reaper thread that wakes up
from time to time to take care of stray scroll searches. You observed this
as a "time delay". Ordinary search actions never keep resources open at
shard level.

Using scroll search for creating large CSV exports is adequate because
this iterates through the result set doc by doc. But replacing a
full-fledged search that has facets/filters/aggregations/sorting with a
scroll search, you will only create large overheads (if it is even
possible).

A null scroll ID is a matter of API design. By using hit length check for
0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Jörg

On Tue, Jun 17, 2014 at 7:46 PM, mooky nick.mi...@gmail.com wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

Does it make sense to execute a search request with scroll, but
SearchType != SCAN?

Does it make sense to execute a search request with scroll, and
also with facet/aggregations?

What is the difference between scrolling to the end of the
results (ie calling until hits.length ==0) and issuing a specific
ClearScrollRequest? It appears to me that the ClearScrollRequest
immediately clears the scroll - whereas there is some time delay before a
scroll is cleaned up after reaching the end of the results. ( I can see
this in my tests because the ElasticsearchIntegrationTest fails on teardown
unless I perform an explicit ClearScrollRequest or I put a delay of some
number of seconds). From reading the docs, I am not sure if this a bug or
expected behaviour.

Does the scrollId represent the cursor, or the cursor
page/iteration state? I have read documentation/mailing list explanations
that have words to the effect "you must pass the scrollId from the previous
response into the subsequent request" - which suggests the id represents
some cursor state - ie performing a scroll request with a given scrollId
will always return the same results. My observation, however, is that the
scrollId does not change (ie I get back the same scrollId I passed in) so
each scroll request with the same scrollId advances the 'cursor' until no
results are returned. I have also read stuff on the mailing list that
implied multiple calls could be made in parallel with the same scrollId to
load all the results faster (which would imply the scrollId is not expected
to change). So which is correct?

To explain the background for my questions: I have two requirements :

I get an update event that leads me to go find items in the index
that need re-indexing. I perform a search on the index, I get the id's and
I load the original data from the source system(s) to reconstruct the
document and index it. This seems to be exactly what SCAN and SCROLL is
meant for. (However, the SCAN search type is different in that it always
returns zero hits from the original search request - only the scroll
requests seem to

The user normally performs a search, and naturally we limit how many
results we serve to the client. However, occasionally, the user wants to
return all the data for a given search/filter (say, to export to excel or
whatever), so it seems like a good idea to use the scroll rather than
paging through the results using from&size as we know we will get a
consistent results even if documents are being added/removed/updated on the
server.
From a functionality perspective, I want to make sure the scrolling
search request is the same as the non-scrolling search request so the user
gets the same results - so from a code perspective, ideally I really want
to make the codepath the same (save for adding the scroll keepAlive param).
However, perhaps there are things I perform with my normal search (e.g.
aggregations, SearchType.DEFAULT, etc) that just don't make sense when
scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b88cfdb7-1072-45a4-9c0e-a4bf77be8226%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

afaraone · June 23, 2016, 4:27pm

I'm interested in clarifying question number one. why does it make sense to specify scroll without searchType=scan?

to preserve sort order or some other part of the search?

Topic		Replies	Views
Clear scroll clarification Elasticsearch	2	1698	March 27, 2017
Clear scroll with slices Elasticsearch	3	530	March 28, 2019
Scroll Search Bug? Elasticsearch	4	2587	July 6, 2017
Scroll search request returns documents but Scan does not Elasticsearch	1	741	July 6, 2017
Do unique/reusable _scroll_ids exist? Elasticsearch	4	1511	July 6, 2017

Scroll Questions

Related topics