Scroll Questions

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

  1. Does it make sense to execute a search request with scroll, but
    SearchType != SCAN?
  2. Does it make sense to execute a search request with scroll, and also
    with facet/aggregations?
  3. What is the difference between scrolling to the end of the results
    (ie calling until hits.length ==0) and issuing a specific
    ClearScrollRequest? It appears to me that the ClearScrollRequest
    immediately clears the scroll - whereas there is some time delay before a
    scroll is cleaned up after reaching the end of the results. ( I can see
    this in my tests because the ElasticsearchIntegrationTest fails on teardown
    unless I perform an explicit ClearScrollRequest or I put a delay of some
    number of seconds). From reading the docs, I am not sure if this a bug or
    expected behaviour.
  4. Does the scrollId represent the cursor, or the cursor page/iteration
    state? I have read documentation/mailing list explanations that have words
    to the effect "you must pass the scrollId from the previous response into
    the subsequent request" - which suggests the id represents some cursor
    state - ie performing a scroll request with a given scrollId will always
    return the same results. My observation, however, is that the scrollId does
    not change (ie I get back the same scrollId I passed in) so each scroll
    request with the same scrollId advances the 'cursor' until no results are
    returned. I have also read stuff on the mailing list that implied multiple
    calls could be made in parallel with the same scrollId to load all the
    results faster (which would imply the scrollId is not expected to
    change). So which is correct? :slight_smile:

To explain the background for my questions: I have two requirements :

  1. I get an update event that leads me to go find items in the index that
    need re-indexing. I perform a search on the index, I get the id's and I
    load the original data from the source system(s) to reconstruct the
    document and index it. This seems to be exactly what SCAN and SCROLL is
    meant for. (However, the SCAN search type is different in that it always
    returns zero hits from the original search request - only the scroll
    requests seem to

  2. The user normally performs a search, and naturally we limit how many
    results we serve to the client. However, occasionally, the user wants to
    return all the data for a given search/filter (say, to export to excel or
    whatever), so it seems like a good idea to use the scroll rather than
    paging through the results using from&size as we know we will get a
    consistent results even if documents are being added/removed/updated on the
    server.
    From a functionality perspective, I want to make sure the scrolling search
    request is the same as the non-scrolling search request so the user gets
    the same results - so from a code perspective, ideally I really want to
    make the codepath the same (save for adding the scroll keepAlive param).
    However, perhaps there are things I perform with my normal search (e.g.
    aggregations, SearchType.DEFAULT, etc) that just don't make sense when
    scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

One more question I forgot:

Rather than looking at hits.length to know if the end of the scroll has
been reached, would it not be better to return a null scrollId when the end
of the cursor has been reached? On the surface it seems that would be
a) more intuitive
b) be the same regardless of which SearchType you are using
c) not be affected by the search itself returning zero results

Cheers.

On Tuesday, 17 June 2014 18:46:07 UTC+1, mooky wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

  1. Does it make sense to execute a search request with scroll, but
    SearchType != SCAN?
  2. Does it make sense to execute a search request with scroll, and
    also with facet/aggregations?
  3. What is the difference between scrolling to the end of the results
    (ie calling until hits.length ==0) and issuing a specific
    ClearScrollRequest? It appears to me that the ClearScrollRequest
    immediately clears the scroll - whereas there is some time delay before a
    scroll is cleaned up after reaching the end of the results. ( I can see
    this in my tests because the ElasticsearchIntegrationTest fails on teardown
    unless I perform an explicit ClearScrollRequest or I put a delay of some
    number of seconds). From reading the docs, I am not sure if this a bug or
    expected behaviour.
  4. Does the scrollId represent the cursor, or the cursor
    page/iteration state? I have read documentation/mailing list explanations
    that have words to the effect "you must pass the scrollId from the previous
    response into the subsequent request" - which suggests the id represents
    some cursor state - ie performing a scroll request with a given scrollId
    will always return the same results. My observation, however, is that the
    scrollId does not change (ie I get back the same scrollId I passed in) so
    each scroll request with the same scrollId advances the 'cursor' until no
    results are returned. I have also read stuff on the mailing list that
    implied multiple calls could be made in parallel with the same scrollId to
    load all the results faster (which would imply the scrollId is not expected
    to change). So which is correct? :slight_smile:

To explain the background for my questions: I have two requirements :

  1. I get an update event that leads me to go find items in the index that
    need re-indexing. I perform a search on the index, I get the id's and I
    load the original data from the source system(s) to reconstruct the
    document and index it. This seems to be exactly what SCAN and SCROLL is
    meant for. (However, the SCAN search type is different in that it always
    returns zero hits from the original search request - only the scroll
    requests seem to

  2. The user normally performs a search, and naturally we limit how many
    results we serve to the client. However, occasionally, the user wants to
    return all the data for a given search/filter (say, to export to excel or
    whatever), so it seems like a good idea to use the scroll rather than
    paging through the results using from&size as we know we will get a
    consistent results even if documents are being added/removed/updated on the
    server.
    From a functionality perspective, I want to make sure the scrolling search
    request is the same as the non-scrolling search request so the user gets
    the same results - so from a code perspective, ideally I really want to
    make the codepath the same (save for adding the scroll keepAlive param).
    However, perhaps there are things I perform with my normal search (e.g.
    aggregations, SearchType.DEFAULT, etc) that just don't make sense when
    scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b6697426-9d3b-43e4-8c9e-cd14bf3c7859%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

  1. yes

  2. facet/aggregations are not very useful while scrolling (I doubt they
    even work at all) because scrolling works on shard level and aggregations
    work on indices level

  3. a scroll request takes resources. The purpose of ClearScrollRequest is
    to release those resources explicitly. This is indeed a rare situation when
    you need explicit clearing. The time delay of releasing scrolls implicitly
    can be controlled by the requests.

  4. yes, the scroll id is an encoding of the combined state of all the
    shards that participate in the scroll. Even if the ID looks as if it has
    not changed, you should always use the latest reference to the scroll ID in
    the response, or you may clutter the nodes with unreleased scroll resources.

Scrolling is very different from search, because there is a shard-level
machinery that iterates over the Lucene segments and keep them open. This
tends to ramp up lots of server-side resources, which may long-lived - a
challenge for resource management. There is a reaper thread that wakes up
from time to time to take care of stray scroll searches. You observed this
as a "time delay". Ordinary search actions never keep resources open at
shard level.

Using scroll search for creating large CSV exports is adequate because this
iterates through the result set doc by doc. But replacing a full-fledged
search that has facets/filters/aggregations/sorting with a scroll search,
you will only create large overheads (if it is even possible).

A null scroll ID is a matter of API design. By using hit length check for
0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Jörg

On Tue, Jun 17, 2014 at 7:46 PM, mooky nick.minutello@gmail.com wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

  1. Does it make sense to execute a search request with scroll, but
    SearchType != SCAN?
  2. Does it make sense to execute a search request with scroll, and
    also with facet/aggregations?
  3. What is the difference between scrolling to the end of the results
    (ie calling until hits.length ==0) and issuing a specific
    ClearScrollRequest? It appears to me that the ClearScrollRequest
    immediately clears the scroll - whereas there is some time delay before a
    scroll is cleaned up after reaching the end of the results. ( I can see
    this in my tests because the ElasticsearchIntegrationTest fails on teardown
    unless I perform an explicit ClearScrollRequest or I put a delay of some
    number of seconds). From reading the docs, I am not sure if this a bug or
    expected behaviour.
  4. Does the scrollId represent the cursor, or the cursor
    page/iteration state? I have read documentation/mailing list explanations
    that have words to the effect "you must pass the scrollId from the previous
    response into the subsequent request" - which suggests the id represents
    some cursor state - ie performing a scroll request with a given scrollId
    will always return the same results. My observation, however, is that the
    scrollId does not change (ie I get back the same scrollId I passed in) so
    each scroll request with the same scrollId advances the 'cursor' until no
    results are returned. I have also read stuff on the mailing list that
    implied multiple calls could be made in parallel with the same scrollId to
    load all the results faster (which would imply the scrollId is not expected
    to change). So which is correct? :slight_smile:

To explain the background for my questions: I have two requirements :

  1. I get an update event that leads me to go find items in the index that
    need re-indexing. I perform a search on the index, I get the id's and I
    load the original data from the source system(s) to reconstruct the
    document and index it. This seems to be exactly what SCAN and SCROLL is
    meant for. (However, the SCAN search type is different in that it always
    returns zero hits from the original search request - only the scroll
    requests seem to

  2. The user normally performs a search, and naturally we limit how many
    results we serve to the client. However, occasionally, the user wants to
    return all the data for a given search/filter (say, to export to excel or
    whatever), so it seems like a good idea to use the scroll rather than
    paging through the results using from&size as we know we will get a
    consistent results even if documents are being added/removed/updated on the
    server.
    From a functionality perspective, I want to make sure the scrolling search
    request is the same as the non-scrolling search request so the user gets
    the same results - so from a code perspective, ideally I really want to
    make the codepath the same (save for adding the scroll keepAlive param).
    However, perhaps there are things I perform with my normal search (e.g.
    aggregations, SearchType.DEFAULT, etc) that just don't make sense when
    scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGisnqcj9F9nUOJbojZtW9yE4bOgWwPHKaN7DbRyAJ_UA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Many thanks Jörg.

Further questions/comments inline:

  1. yes

Thanks,

  1. facet/aggregations are not very useful while scrolling (I doubt they

even work at all) because scrolling works on shard level and aggregations
work on indices level

If they are not expected to work, would it make sense to either:

  1. prevent aggregation/facet requests in conjunction with scroll
    requests (ie give an error to the user)
  2. Simply not execute them?

If it doesn't make sense, would it be better to not return any
aggregation/facet results at all?

  1. a scroll request takes resources. The purpose of ClearScrollRequest is

to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.

Do you mean the keepAlive time? So, does the scroll (and its resources)
always remain for the duration of the keepAlive (since the last request on
that scroll) regardless of whether the end of the scroll was reached or not?

I read the following (from the documentation) to imply that reading to the
end of the scroll had the effect of "aborting" and therefore cleaning up
resources.

Besides consuming the scroll search until no hits has been returned a
scroll search can also be aborted by deleting the scroll_id

So, just to confirm, reading to the end of the results does nothing in
terms of bringing about the cleanup of the scroll? Its either the TTL or
the ClearScrollRequest that brings about the cleanup of resources.

Is there any downside to calling ClearScrollRequest explicitly?
(I am inclined to call it explicitly when the end of the scroll is reached
in order clean up resources asap)

  1. yes, the scroll id is an encoding of the combined state of all the

shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.

Thanks for the explanation.

A null scroll ID is a matter of API design. By using hit length check for

0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Agreed. Its a matter of API style/design.
The only issue I have with checking hits.length is that depending on the
SearchType, sometimes hits.length==0 does not mean the end of the results
(e.g. SearchType.SCAN). Its the lack of consistency that bothers me about
it. It requires the code that handles results to be aware of a detail of
the request.

My case for using scrollId is that:
The scrollId is already null if no scroll is requested.
For this reason, (IMO) scrollId==null would be a more consistent indicator
of no scrolling required - or no further scrolling required. Also it would
re-enforce the notion that the user should always use/observe the returned
scrollId - they would have to.

Cheers,
-Nick

On Wednesday, 18 June 2014 00:04:06 UTC+1, Jörg Prante wrote:

  1. yes

  2. facet/aggregations are not very useful while scrolling (I doubt they
    even work at all) because scrolling works on shard level and aggregations
    work on indices level

  3. a scroll request takes resources. The purpose of ClearScrollRequest is
    to release those resources explicitly. This is indeed a rare situation when
    you need explicit clearing. The time delay of releasing scrolls implicitly
    can be controlled by the requests.

  4. yes, the scroll id is an encoding of the combined state of all the
    shards that participate in the scroll. Even if the ID looks as if it has
    not changed, you should always use the latest reference to the scroll ID in
    the response, or you may clutter the nodes with unreleased scroll resources.

Scrolling is very different from search, because there is a shard-level
machinery that iterates over the Lucene segments and keep them open. This
tends to ramp up lots of server-side resources, which may long-lived - a
challenge for resource management. There is a reaper thread that wakes up
from time to time to take care of stray scroll searches. You observed this
as a "time delay". Ordinary search actions never keep resources open at
shard level.

Using scroll search for creating large CSV exports is adequate because
this iterates through the result set doc by doc. But replacing a
full-fledged search that has facets/filters/aggregations/sorting with a
scroll search, you will only create large overheads (if it is even
possible).

A null scroll ID is a matter of API design. By using hit length check for
0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Jörg

On Tue, Jun 17, 2014 at 7:46 PM, mooky <nick.mi...@gmail.com <javascript:>

wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

  1. Does it make sense to execute a search request with scroll, but
    SearchType != SCAN?
  2. Does it make sense to execute a search request with scroll, and
    also with facet/aggregations?
  3. What is the difference between scrolling to the end of the results
    (ie calling until hits.length ==0) and issuing a specific
    ClearScrollRequest? It appears to me that the ClearScrollRequest
    immediately clears the scroll - whereas there is some time delay before a
    scroll is cleaned up after reaching the end of the results. ( I can see
    this in my tests because the ElasticsearchIntegrationTest fails on teardown
    unless I perform an explicit ClearScrollRequest or I put a delay of some
    number of seconds). From reading the docs, I am not sure if this a bug or
    expected behaviour.
  4. Does the scrollId represent the cursor, or the cursor
    page/iteration state? I have read documentation/mailing list explanations
    that have words to the effect "you must pass the scrollId from the previous
    response into the subsequent request" - which suggests the id represents
    some cursor state - ie performing a scroll request with a given scrollId
    will always return the same results. My observation, however, is that the
    scrollId does not change (ie I get back the same scrollId I passed in) so
    each scroll request with the same scrollId advances the 'cursor' until no
    results are returned. I have also read stuff on the mailing list that
    implied multiple calls could be made in parallel with the same scrollId to
    load all the results faster (which would imply the scrollId is not expected
    to change). So which is correct? :slight_smile:

To explain the background for my questions: I have two requirements :

  1. I get an update event that leads me to go find items in the index that
    need re-indexing. I perform a search on the index, I get the id's and I
    load the original data from the source system(s) to reconstruct the
    document and index it. This seems to be exactly what SCAN and SCROLL is
    meant for. (However, the SCAN search type is different in that it always
    returns zero hits from the original search request - only the scroll
    requests seem to

  2. The user normally performs a search, and naturally we limit how many
    results we serve to the client. However, occasionally, the user wants to
    return all the data for a given search/filter (say, to export to excel or
    whatever), so it seems like a good idea to use the scroll rather than
    paging through the results using from&size as we know we will get a
    consistent results even if documents are being added/removed/updated on the
    server.
    From a functionality perspective, I want to make sure the scrolling
    search request is the same as the non-scrolling search request so the user
    gets the same results - so from a code perspective, ideally I really want
    to make the codepath the same (save for adding the scroll keepAlive param).
    However, perhaps there are things I perform with my normal search (e.g.
    aggregations, SearchType.DEFAULT, etc) that just don't make sense when
    scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ad0f4c3f-fd11-4af6-b50a-bbf8f7e8695a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Furthermore on using hits.length ==0,
Shard failure(s) can mean hits.length==0 but perhaps the end of the scroll.

On Tuesday, 17 June 2014 18:46:07 UTC+1, mooky wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

  1. Does it make sense to execute a search request with scroll, but
    SearchType != SCAN?
  2. Does it make sense to execute a search request with scroll, and
    also with facet/aggregations?
  3. What is the difference between scrolling to the end of the results
    (ie calling until hits.length ==0) and issuing a specific
    ClearScrollRequest? It appears to me that the ClearScrollRequest
    immediately clears the scroll - whereas there is some time delay before a
    scroll is cleaned up after reaching the end of the results. ( I can see
    this in my tests because the ElasticsearchIntegrationTest fails on teardown
    unless I perform an explicit ClearScrollRequest or I put a delay of some
    number of seconds). From reading the docs, I am not sure if this a bug or
    expected behaviour.
  4. Does the scrollId represent the cursor, or the cursor
    page/iteration state? I have read documentation/mailing list explanations
    that have words to the effect "you must pass the scrollId from the previous
    response into the subsequent request" - which suggests the id represents
    some cursor state - ie performing a scroll request with a given scrollId
    will always return the same results. My observation, however, is that the
    scrollId does not change (ie I get back the same scrollId I passed in) so
    each scroll request with the same scrollId advances the 'cursor' until no
    results are returned. I have also read stuff on the mailing list that
    implied multiple calls could be made in parallel with the same scrollId to
    load all the results faster (which would imply the scrollId is not expected
    to change). So which is correct? :slight_smile:

To explain the background for my questions: I have two requirements :

  1. I get an update event that leads me to go find items in the index that
    need re-indexing. I perform a search on the index, I get the id's and I
    load the original data from the source system(s) to reconstruct the
    document and index it. This seems to be exactly what SCAN and SCROLL is
    meant for. (However, the SCAN search type is different in that it always
    returns zero hits from the original search request - only the scroll
    requests seem to

  2. The user normally performs a search, and naturally we limit how many
    results we serve to the client. However, occasionally, the user wants to
    return all the data for a given search/filter (say, to export to excel or
    whatever), so it seems like a good idea to use the scroll rather than
    paging through the results using from&size as we know we will get a
    consistent results even if documents are being added/removed/updated on the
    server.
    From a functionality perspective, I want to make sure the scrolling search
    request is the same as the non-scrolling search request so the user gets
    the same results - so from a code perspective, ideally I really want to
    make the codepath the same (save for adding the scroll keepAlive param).
    However, perhaps there are things I perform with my normal search (e.g.
    aggregations, SearchType.DEFAULT, etc) that just don't make sense when
    scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/788e5f30-2a7e-4777-9377-9357c283bf2b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Further to (2). Would it be an improvement to have a different kind of
request for a scrolling search - that way the api could exclude items that
don't make sense (e.g. aggregations, facets, etc)

On Wednesday, 18 June 2014 10:28:06 UTC+1, mooky wrote:

Many thanks Jörg.

Further questions/comments inline:

  1. yes

Thanks,

  1. facet/aggregations are not very useful while scrolling (I doubt they

even work at all) because scrolling works on shard level and aggregations
work on indices level

If they are not expected to work, would it make sense to either:

  1. prevent aggregation/facet requests in conjunction with scroll
    requests (ie give an error to the user)
  2. Simply not execute them?

If it doesn't make sense, would it be better to not return any
aggregation/facet results at all?

  1. a scroll request takes resources. The purpose of ClearScrollRequest is

to release those resources explicitly. This is indeed a rare situation when
you need explicit clearing. The time delay of releasing scrolls implicitly
can be controlled by the requests.

Do you mean the keepAlive time? So, does the scroll (and its resources)
always remain for the duration of the keepAlive (since the last request on
that scroll) regardless of whether the end of the scroll was reached or not?

I read the following (from the documentation) to imply that reading to the
end of the scroll had the effect of "aborting" and therefore cleaning up
resources.

Besides consuming the scroll search until no hits has been returned a
scroll search can also be aborted by deleting the scroll_id

So, just to confirm, reading to the end of the results does nothing in
terms of bringing about the cleanup of the scroll? Its either the TTL or
the ClearScrollRequest that brings about the cleanup of resources.

Is there any downside to calling ClearScrollRequest explicitly?
(I am inclined to call it explicitly when the end of the scroll is reached
in order clean up resources asap)

  1. yes, the scroll id is an encoding of the combined state of all the

shards that participate in the scroll. Even if the ID looks as if it has
not changed, you should always use the latest reference to the scroll ID in
the response, or you may clutter the nodes with unreleased scroll resources.

Thanks for the explanation.

A null scroll ID is a matter of API design. By using hit length check for

0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Agreed. Its a matter of API style/design.
The only issue I have with checking hits.length is that depending on the
SearchType, sometimes hits.length==0 does not mean the end of the results
(e.g. SearchType.SCAN). Its the lack of consistency that bothers me about
it. It requires the code that handles results to be aware of a detail of
the request.

My case for using scrollId is that:
The scrollId is already null if no scroll is requested.
For this reason, (IMO) scrollId==null would be a more consistent indicator
of no scrolling required - or no further scrolling required. Also it would
re-enforce the notion that the user should always use/observe the returned
scrollId - they would have to.

Cheers,
-Nick

On Wednesday, 18 June 2014 00:04:06 UTC+1, Jörg Prante wrote:

  1. yes

  2. facet/aggregations are not very useful while scrolling (I doubt they
    even work at all) because scrolling works on shard level and aggregations
    work on indices level

  3. a scroll request takes resources. The purpose of ClearScrollRequest is
    to release those resources explicitly. This is indeed a rare situation when
    you need explicit clearing. The time delay of releasing scrolls implicitly
    can be controlled by the requests.

  4. yes, the scroll id is an encoding of the combined state of all the
    shards that participate in the scroll. Even if the ID looks as if it has
    not changed, you should always use the latest reference to the scroll ID in
    the response, or you may clutter the nodes with unreleased scroll resources.

Scrolling is very different from search, because there is a shard-level
machinery that iterates over the Lucene segments and keep them open. This
tends to ramp up lots of server-side resources, which may long-lived - a
challenge for resource management. There is a reaper thread that wakes up
from time to time to take care of stray scroll searches. You observed this
as a "time delay". Ordinary search actions never keep resources open at
shard level.

Using scroll search for creating large CSV exports is adequate because
this iterates through the result set doc by doc. But replacing a
full-fledged search that has facets/filters/aggregations/sorting with a
scroll search, you will only create large overheads (if it is even
possible).

A null scroll ID is a matter of API design. By using hit length check for
0, you can use the same condition for other queries, so it is convenient
and not confusing. Null scroll IDs are always prone to NPEs.

Jörg

On Tue, Jun 17, 2014 at 7:46 PM, mooky nick.mi...@gmail.com wrote:

Having hit a bunch of issues using scroll, I thought I better improve my
understanding of how scroll is supposed to be used (and how its not
supposed to be used).

  1. Does it make sense to execute a search request with scroll, but
    SearchType != SCAN?
  2. Does it make sense to execute a search request with scroll, and
    also with facet/aggregations?
  3. What is the difference between scrolling to the end of the
    results (ie calling until hits.length ==0) and issuing a specific
    ClearScrollRequest? It appears to me that the ClearScrollRequest
    immediately clears the scroll - whereas there is some time delay before a
    scroll is cleaned up after reaching the end of the results. ( I can see
    this in my tests because the ElasticsearchIntegrationTest fails on teardown
    unless I perform an explicit ClearScrollRequest or I put a delay of some
    number of seconds). From reading the docs, I am not sure if this a bug or
    expected behaviour.
  4. Does the scrollId represent the cursor, or the cursor
    page/iteration state? I have read documentation/mailing list explanations
    that have words to the effect "you must pass the scrollId from the previous
    response into the subsequent request" - which suggests the id represents
    some cursor state - ie performing a scroll request with a given scrollId
    will always return the same results. My observation, however, is that the
    scrollId does not change (ie I get back the same scrollId I passed in) so
    each scroll request with the same scrollId advances the 'cursor' until no
    results are returned. I have also read stuff on the mailing list that
    implied multiple calls could be made in parallel with the same scrollId to
    load all the results faster (which would imply the scrollId is not expected
    to change). So which is correct? :slight_smile:

To explain the background for my questions: I have two requirements :

  1. I get an update event that leads me to go find items in the index
    that need re-indexing. I perform a search on the index, I get the id's and
    I load the original data from the source system(s) to reconstruct the
    document and index it. This seems to be exactly what SCAN and SCROLL is
    meant for. (However, the SCAN search type is different in that it always
    returns zero hits from the original search request - only the scroll
    requests seem to

  2. The user normally performs a search, and naturally we limit how many
    results we serve to the client. However, occasionally, the user wants to
    return all the data for a given search/filter (say, to export to excel or
    whatever), so it seems like a good idea to use the scroll rather than
    paging through the results using from&size as we know we will get a
    consistent results even if documents are being added/removed/updated on the
    server.
    From a functionality perspective, I want to make sure the scrolling
    search request is the same as the non-scrolling search request so the user
    gets the same results - so from a code perspective, ideally I really want
    to make the codepath the same (save for adding the scroll keepAlive param).
    However, perhaps there are things I perform with my normal search (e.g.
    aggregations, SearchType.DEFAULT, etc) that just don't make sense when
    scrolling?

Many thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/80f173a7-07a0-4f72-a896-944223a3ac30%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b88cfdb7-1072-45a4-9c0e-a4bf77be8226%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I'm interested in clarifying question number one. why does it make sense to specify scroll without searchType=scan?

to preserve sort order or some other part of the search?