How much overhead for scroll search_type?

I want to page through (unsorted) search results in a way that provides
consistent results from one page to the next -- ideally even if there are
docs being indexed/deleted at the same time. I will have potentially
thousands of concurrent searches, but the paging for each individual search
will happen programmatically, so all page requests for the same search will
happen and finish within the period of a few seconds or less.

Using from/size parameters is not self-consistent during concurrent
indexes. I also wonder if, even when there are not concurrent writes, it's
guaranteed to be self-consistent from one page to the next (when no sorting
is specified) ... this claim is not documented anyplace.

search_type=scroll purports to do exactly what I need. I like that all
pages of results correspond to the same search timestamp and that results
are consistent without the overhead of sorting large result sets. Because
I'm searching programmatically, I can use scroll=5s.

However, the documentation sayshttp://www.elasticsearch.org/guide/reference/api/search/scroll/I shouldn't use scrolling for "real time user requests"; I presume it's
storing some state on the data nodes within the expiry time. Can you
provide more insight into the reasons behind this restriction? How
significant is the overhead, in practice, of using "scroll" for real-time
queries -- up to a few thousand searches (scroll_ids) open at the same
time, with a quite small expiry?

Thanks!
Jeffrey

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

From what I read from the source code, the scroll search is just a
saved search with the help of a scroll id. The scroll id is used to
encode the node/shard request state to continue a previously executed
query. By doing this, you can execute searches as a sequence of equally
formulated search steps. It does not isolate your sequence search action
from other updates actions like a session would do in a transactional
environment. So if you update docs with another client while you step
through a scroll search, the updates may or may not appear in your
results while you loop over the search result, depending on the ongoing
write/refresh operations across the nodes.

My understanding of the remark about "real time user requests" is that
with scroll search you can not rely on the Lucene "near realtime"
feature, which ensures you can see immediately a document in the GET API
after it has been created, not affected by the refresh operations.

The scroll id is very compact, there is a slight overhead of managing
them on the heap together with encoding/decoding them, but that is
minimal. If the scroll id life time has exceeded, you will get an error
in the search API, and the scroll search resources will get garbage
collected.

Jörg

Am 09.04.13 19:39, schrieb Jeffrey Gerard:

I want to page through (unsorted) search results in a way that
provides consistent results from one page to the next -- ideally even
if there are docs being indexed/deleted at the same time. I will have
potentially thousands of concurrent searches, but the paging for each
individual search will happen programmatically, so all page requests
for the same search will happen and finish within the period of a few
seconds or less.

Using from/size parameters is not self-consistent during concurrent
indexes. I also wonder if, even when there are not concurrent writes,
it's guaranteed to be self-consistent from one page to the next (when
no sorting is specified) ... this claim is not documented anyplace.

search_type=scroll purports to do exactly what I need. I like that
all pages of results correspond to the same search timestamp and that
results are consistent without the overhead of sorting large result
sets. Because I'm searching programmatically, I can use scroll=5s.

However, the documentation says
http://www.elasticsearch.org/guide/reference/api/search/scroll/ I
shouldn't use scrolling for "real time user requests"; I presume it's
storing some state on the data nodes within the expiry time. Can you
provide more insight into the reasons behind this restriction? How
significant is the overhead, in practice, of using "scroll" for
real-time queries -- up to a few thousand searches (scroll_ids) open
at the same time, with a quite small expiry?

Thanks!
Jeffrey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Great answer -- thanks for the clarification on this!

On Tuesday, April 9, 2013 3:46:06 PM UTC-7, Jörg Prante wrote:

From what I read from the source code, the scroll search is just a
saved search with the help of a scroll id. The scroll id is used to
encode the node/shard request state to continue a previously executed
query. By doing this, you can execute searches as a sequence of equally
formulated search steps. It does not isolate your sequence search action
from other updates actions like a session would do in a transactional
environment. So if you update docs with another client while you step
through a scroll search, the updates may or may not appear in your
results while you loop over the search result, depending on the ongoing
write/refresh operations across the nodes.

My understanding of the remark about "real time user requests" is that
with scroll search you can not rely on the Lucene "near realtime"
feature, which ensures you can see immediately a document in the GET API
after it has been created, not affected by the refresh operations.

The scroll id is very compact, there is a slight overhead of managing
them on the heap together with encoding/decoding them, but that is
minimal. If the scroll id life time has exceeded, you will get an error
in the search API, and the scroll search resources will get garbage
collected.

Jörg

Am 09.04.13 19:39, schrieb Jeffrey Gerard:

I want to page through (unsorted) search results in a way that
provides consistent results from one page to the next -- ideally even
if there are docs being indexed/deleted at the same time. I will have
potentially thousands of concurrent searches, but the paging for each
individual search will happen programmatically, so all page requests
for the same search will happen and finish within the period of a few
seconds or less.

Using from/size parameters is not self-consistent during concurrent
indexes. I also wonder if, even when there are not concurrent writes,
it's guaranteed to be self-consistent from one page to the next (when
no sorting is specified) ... this claim is not documented anyplace.

search_type=scroll purports to do exactly what I need. I like that
all pages of results correspond to the same search timestamp and that
results are consistent without the overhead of sorting large result
sets. Because I'm searching programmatically, I can use scroll=5s.

However, the documentation says
http://www.elasticsearch.org/guide/reference/api/search/scroll/ I
shouldn't use scrolling for "real time user requests"; I presume it's
storing some state on the data nodes within the expiry time. Can you
provide more insight into the reasons behind this restriction? How
significant is the overhead, in practice, of using "scroll" for
real-time queries -- up to a few thousand searches (scroll_ids) open
at the same time, with a quite small expiry?

Thanks!
Jeffrey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

There's a thread from 2011 in whichhttps://groups.google.com/d/msg/elasticsearch/Cord2_BqO2s/x4500A8INHsJShay says "Scan search type is a point in time search, when its executed.
You won't see changes (either deletions or new docs) after its first
execution." and there's a "guarantee you won't see duplicates or changed
data

On the other hand, this is not actually in the ES documentation. Has this
behavior changed since then to no longer be transactional?

On Tuesday, April 9, 2013 3:46:06 PM UTC-7, Jörg Prante wrote:

From what I read from the source code, the scroll search is just a
saved search with the help of a scroll id. The scroll id is used to
encode the node/shard request state to continue a previously executed
query. By doing this, you can execute searches as a sequence of equally
formulated search steps. It does not isolate your sequence search action
from other updates actions like a session would do in a transactional
environment. So if you update docs with another client while you step
through a scroll search, the updates may or may not appear in your
results while you loop over the search result, depending on the ongoing
write/refresh operations across the nodes.

My understanding of the remark about "real time user requests" is that
with scroll search you can not rely on the Lucene "near realtime"
feature, which ensures you can see immediately a document in the GET API
after it has been created, not affected by the refresh operations.

The scroll id is very compact, there is a slight overhead of managing
them on the heap together with encoding/decoding them, but that is
minimal. If the scroll id life time has exceeded, you will get an error
in the search API, and the scroll search resources will get garbage
collected.

Jörg

Am 09.04.13 19:39, schrieb Jeffrey Gerard:

I want to page through (unsorted) search results in a way that
provides consistent results from one page to the next -- ideally even
if there are docs being indexed/deleted at the same time. I will have
potentially thousands of concurrent searches, but the paging for each
individual search will happen programmatically, so all page requests
for the same search will happen and finish within the period of a few
seconds or less.

Using from/size parameters is not self-consistent during concurrent
indexes. I also wonder if, even when there are not concurrent writes,
it's guaranteed to be self-consistent from one page to the next (when
no sorting is specified) ... this claim is not documented anyplace.

search_type=scroll purports to do exactly what I need. I like that
all pages of results correspond to the same search timestamp and that
results are consistent without the overhead of sorting large result
sets. Because I'm searching programmatically, I can use scroll=5s.

However, the documentation says
http://www.elasticsearch.org/guide/reference/api/search/scroll/ I
shouldn't use scrolling for "real time user requests"; I presume it's
storing some state on the data nodes within the expiry time. Can you
provide more insight into the reasons behind this restriction? How
significant is the overhead, in practice, of using "scroll" for
real-time queries -- up to a few thousand searches (scroll_ids) open
at the same time, with a quite small expiry?

Thanks!
Jeffrey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I had similar questions and performed a few rough local tests with a small
set of data (<150 docs) this evening. What I saw aligned with what kimchy
stated in the 2011 thread Jeffrey quoted.

I didn't look at the source so can't guarantee anything about
Elasticsearch but the observations may be useful:

Obtained a scroll id for query where type was not added, then
created/added documents to that type before subsequent requests: Yielded
zero results.

Added documents to type then obtained a scroll id and performed
subsequent requests: Yielded appropriate number of documents.

Obtained a scroll id, deleted entire type and performing subsequent
requests: Requests performed after deletion yielded no results.

Obtained a scroll id, then added new documents that matched the query,
during subsequent requests: Did not yield newly added documents (i.e.:
documents from initial query were preserved).

Obtained a scroll id, then deleted documents that matched query, during
subsequent requests: Deleted documents were still returned in the result
set (i.e.: documents from initial query were preserved).
Obtained a scroll id, then modified documents, during subsequent requests:
Document remained unchanged (i.e.: documents from initial query were
preserved).

  • oli

On Wednesday, April 10, 2013 12:38:35 PM UTC-7, Jeffrey Gerard wrote:

There's a thread from 2011 in whichhttps://groups.google.com/d/msg/elasticsearch/Cord2_BqO2s/x4500A8INHsJShay says "Scan search type is a point in time search, when its executed.
You won't see changes (either deletions or new docs) after its first
execution." and there's a "guarantee you won't see duplicates or changed
data

On the other hand, this is not actually in the ES documentation. Has this
behavior changed since then to no longer be transactional?

On Tuesday, April 9, 2013 3:46:06 PM UTC-7, Jörg Prante wrote:

From what I read from the source code, the scroll search is just a
saved search with the help of a scroll id. The scroll id is used to
encode the node/shard request state to continue a previously executed
query. By doing this, you can execute searches as a sequence of equally
formulated search steps. It does not isolate your sequence search action
from other updates actions like a session would do in a transactional
environment. So if you update docs with another client while you step
through a scroll search, the updates may or may not appear in your
results while you loop over the search result, depending on the ongoing
write/refresh operations across the nodes.

My understanding of the remark about "real time user requests" is that
with scroll search you can not rely on the Lucene "near realtime"
feature, which ensures you can see immediately a document in the GET API
after it has been created, not affected by the refresh operations.

The scroll id is very compact, there is a slight overhead of managing
them on the heap together with encoding/decoding them, but that is
minimal. If the scroll id life time has exceeded, you will get an error
in the search API, and the scroll search resources will get garbage
collected.

Jörg

Am 09.04.13 19:39, schrieb Jeffrey Gerard:

I want to page through (unsorted) search results in a way that
provides consistent results from one page to the next -- ideally even
if there are docs being indexed/deleted at the same time. I will have
potentially thousands of concurrent searches, but the paging for each
individual search will happen programmatically, so all page requests
for the same search will happen and finish within the period of a few
seconds or less.

Using from/size parameters is not self-consistent during concurrent
indexes. I also wonder if, even when there are not concurrent writes,
it's guaranteed to be self-consistent from one page to the next (when
no sorting is specified) ... this claim is not documented anyplace.

search_type=scroll purports to do exactly what I need. I like that
all pages of results correspond to the same search timestamp and that
results are consistent without the overhead of sorting large result
sets. Because I'm searching programmatically, I can use scroll=5s.

However, the documentation says
http://www.elasticsearch.org/guide/reference/api/search/scroll/ I
shouldn't use scrolling for "real time user requests"; I presume it's
storing some state on the data nodes within the expiry time. Can you
provide more insight into the reasons behind this restriction? How
significant is the overhead, in practice, of using "scroll" for
real-time queries -- up to a few thousand searches (scroll_ids) open
at the same time, with a quite small expiry?

Thanks!
Jeffrey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

yea, what I state back then holds, it's a point in time search, defined by the first execution.

On Thu, Jun 6, 2013 at 7:37 AM, Oli oli@climate.com wrote:

I had similar questions and performed a few rough local tests with a small
set of data (<150 docs) this evening. What I saw aligned with what kimchy
stated in the 2011 thread Jeffrey quoted.
I didn't look at the source so can't guarantee anything about
Elasticsearch but the observations may be useful:
Obtained a scroll id for query where type was not added, then
created/added documents to that type before subsequent requests: Yielded
zero results.
Added documents to type then obtained a scroll id and performed
subsequent requests: Yielded appropriate number of documents.
Obtained a scroll id, deleted entire type and performing subsequent
requests: Requests performed after deletion yielded no results.
Obtained a scroll id, then added new documents that matched the query,
during subsequent requests: Did not yield newly added documents (i.e.:
documents from initial query were preserved).
Obtained a scroll id, then deleted documents that matched query, during
subsequent requests: Deleted documents were still returned in the result
set (i.e.: documents from initial query were preserved).
Obtained a scroll id, then modified documents, during subsequent requests:
Document remained unchanged (i.e.: documents from initial query were
preserved).

  • oli
    On Wednesday, April 10, 2013 12:38:35 PM UTC-7, Jeffrey Gerard wrote:

There's a thread from 2011 in whichhttps://groups.google.com/d/msg/elasticsearch/Cord2_BqO2s/x4500A8INHsJShay says "Scan search type is a point in time search, when its executed.
You won't see changes (either deletions or new docs) after its first
execution." and there's a "guarantee you won't see duplicates or changed
data

On the other hand, this is not actually in the ES documentation. Has this
behavior changed since then to no longer be transactional?

On Tuesday, April 9, 2013 3:46:06 PM UTC-7, Jörg Prante wrote:

From what I read from the source code, the scroll search is just a
saved search with the help of a scroll id. The scroll id is used to
encode the node/shard request state to continue a previously executed
query. By doing this, you can execute searches as a sequence of equally
formulated search steps. It does not isolate your sequence search action
from other updates actions like a session would do in a transactional
environment. So if you update docs with another client while you step
through a scroll search, the updates may or may not appear in your
results while you loop over the search result, depending on the ongoing
write/refresh operations across the nodes.

My understanding of the remark about "real time user requests" is that
with scroll search you can not rely on the Lucene "near realtime"
feature, which ensures you can see immediately a document in the GET API
after it has been created, not affected by the refresh operations.

The scroll id is very compact, there is a slight overhead of managing
them on the heap together with encoding/decoding them, but that is
minimal. If the scroll id life time has exceeded, you will get an error
in the search API, and the scroll search resources will get garbage
collected.

Jörg

Am 09.04.13 19:39, schrieb Jeffrey Gerard:

I want to page through (unsorted) search results in a way that
provides consistent results from one page to the next -- ideally even
if there are docs being indexed/deleted at the same time. I will have
potentially thousands of concurrent searches, but the paging for each
individual search will happen programmatically, so all page requests
for the same search will happen and finish within the period of a few
seconds or less.

Using from/size parameters is not self-consistent during concurrent
indexes. I also wonder if, even when there are not concurrent writes,
it's guaranteed to be self-consistent from one page to the next (when
no sorting is specified) ... this claim is not documented anyplace.

search_type=scroll purports to do exactly what I need. I like that
all pages of results correspond to the same search timestamp and that
results are consistent without the overhead of sorting large result
sets. Because I'm searching programmatically, I can use scroll=5s.

However, the documentation says
http://www.elasticsearch.org/guide/reference/api/search/scroll/ I
shouldn't use scrolling for "real time user requests"; I presume it's
storing some state on the data nodes within the expiry time. Can you
provide more insight into the reasons behind this restriction? How
significant is the overhead, in practice, of using "scroll" for
real-time queries -- up to a few thousand searches (scroll_ids) open
at the same time, with a quite small expiry?

Thanks!
Jeffrey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for following up on this, I appreciate the confirmation.

Just on the original question, I'm interested in using scrolling for "real
time user requests" which the documentation dissuades me from doing. Are
there significant reasons not do that?

I could see that it may be due to missing documents that are being indexed
after obtaining the scroll, but this isn't really an issue for me. Are
there performance implications or restrictions that might not be obvious?

Thanks very much,

  • oli

On Fri, Jun 7, 2013 at 3:51 AM, Shay Banon kimchy@gmail.com wrote:

yea, what I state back then holds, it's a point in time search, defined by
the first execution.

On Thu, Jun 6, 2013 at 7:37 AM, Oli oli@climate.com wrote:

I had similar questions and performed a few rough local tests with a
small set of data (<150 docs) this evening. What I saw aligned with what
kimchy stated in the 2011 thread Jeffrey quoted.

I didn't look at the source so can't guarantee anything about
Elasticsearch but the observations may be useful:

Obtained a scroll id for query where type was not added, then
created/added documents to that type before subsequent requests: Yielded
zero results.

Added documents to type then obtained a scroll id and performed
subsequent requests: Yielded appropriate number of documents.

Obtained a scroll id, deleted entire type and performing subsequent
requests: Requests performed after deletion yielded no results.

Obtained a scroll id, then added new documents that matched the query,
during subsequent requests: Did not yield newly added documents (i.e.:
documents from initial query were preserved).

Obtained a scroll id, then deleted documents that matched query, during
subsequent requests: Deleted documents were still returned in the result
set (i.e.: documents from initial query were preserved).
Obtained a scroll id, then modified documents, during subsequent
requests: Document remained unchanged (i.e.: documents from initial
query were preserved).

  • oli

On Wednesday, April 10, 2013 12:38:35 PM UTC-7, Jeffrey Gerard wrote:

There's a thread from 2011 in whichhttps://groups.google.com/d/msg/elasticsearch/Cord2_BqO2s/x4500A8INHsJShay says "Scan search type is a point in time search, when its executed.
You won't see changes (either deletions or new docs) after its first
execution." and there's a "guarantee you won't see duplicates or changed
data

On the other hand, this is not actually in the ES documentation. Has
this behavior changed since then to no longer be transactional?

On Tuesday, April 9, 2013 3:46:06 PM UTC-7, Jörg Prante wrote:

From what I read from the source code, the scroll search is just a
saved search with the help of a scroll id. The scroll id is used to
encode the node/shard request state to continue a previously executed
query. By doing this, you can execute searches as a sequence of equally
formulated search steps. It does not isolate your sequence search
action
from other updates actions like a session would do in a transactional
environment. So if you update docs with another client while you step
through a scroll search, the updates may or may not appear in your
results while you loop over the search result, depending on the ongoing
write/refresh operations across the nodes.

My understanding of the remark about "real time user requests" is that
with scroll search you can not rely on the Lucene "near realtime"
feature, which ensures you can see immediately a document in the GET
API
after it has been created, not affected by the refresh operations.

The scroll id is very compact, there is a slight overhead of managing
them on the heap together with encoding/decoding them, but that is
minimal. If the scroll id life time has exceeded, you will get an error
in the search API, and the scroll search resources will get garbage
collected.

Jörg

Am 09.04.13 19:39, schrieb Jeffrey Gerard:

I want to page through (unsorted) search results in a way that
provides consistent results from one page to the next -- ideally even
if there are docs being indexed/deleted at the same time. I will
have
potentially thousands of concurrent searches, but the paging for each
individual search will happen programmatically, so all page requests
for the same search will happen and finish within the period of a few
seconds or less.

Using from/size parameters is not self-consistent during concurrent
indexes. I also wonder if, even when there are not concurrent
writes,
it's guaranteed to be self-consistent from one page to the next (when
no sorting is specified) ... this claim is not documented anyplace.

search_type=scroll purports to do exactly what I need. I like that
all pages of results correspond to the same search timestamp and that
results are consistent without the overhead of sorting large result
sets. Because I'm searching programmatically, I can use scroll=5s.

However, the documentation says
<http://www.elasticsearch.org/**guide/reference/api/search/*scroll/
*** http://www.elasticsearch.org/guide/reference/api/search/scroll/>
I
shouldn't use scrolling for "real time user requests"; I presume it's
storing some state on the data nodes within the expiry time. Can you
provide more insight into the reasons behind this restriction? How
significant is the overhead, in practice, of using "scroll" for
real-time queries -- up to a few thousand searches (scroll_ids) open
at the same time, with a quite small expiry?

Thanks!
Jeffrey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.