I had similar questions and performed a few rough local tests with a small
set of data (<150 docs) this evening. What I saw aligned with what kimchy
stated in the 2011 thread Jeffrey quoted.
I didn't look at the source so can't guarantee anything about
Elasticsearch but the observations may be useful:
Obtained a scroll id for query where type
was not added, then
created/added documents to that type
before subsequent requests: Yielded
zero results.
Added documents to type
then obtained a scroll id and performed
subsequent requests: Yielded appropriate number of documents.
Obtained a scroll id, deleted entire type
and performing subsequent
requests: Requests performed after deletion yielded no results.
Obtained a scroll id, then added new documents that matched the query,
during subsequent requests: Did not yield newly added documents (i.e.:
documents from initial query were preserved).
Obtained a scroll id, then deleted documents that matched query, during
subsequent requests: Deleted documents were still returned in the result
set (i.e.: documents from initial query were preserved).
Obtained a scroll id, then modified documents, during subsequent requests:
Document remained unchanged (i.e.: documents from initial query were
preserved).
On Wednesday, April 10, 2013 12:38:35 PM UTC-7, Jeffrey Gerard wrote:
There's a thread from 2011 in whichhttps://groups.google.com/d/msg/elasticsearch/Cord2_BqO2s/x4500A8INHsJShay says "Scan search type is a point in time search, when its executed.
You won't see changes (either deletions or new docs) after its first
execution." and there's a "guarantee you won't see duplicates or changed
data
On the other hand, this is not actually in the ES documentation. Has this
behavior changed since then to no longer be transactional?
On Tuesday, April 9, 2013 3:46:06 PM UTC-7, Jörg Prante wrote:
From what I read from the source code, the scroll search is just a
saved search with the help of a scroll id. The scroll id is used to
encode the node/shard request state to continue a previously executed
query. By doing this, you can execute searches as a sequence of equally
formulated search steps. It does not isolate your sequence search action
from other updates actions like a session would do in a transactional
environment. So if you update docs with another client while you step
through a scroll search, the updates may or may not appear in your
results while you loop over the search result, depending on the ongoing
write/refresh operations across the nodes.
My understanding of the remark about "real time user requests" is that
with scroll search you can not rely on the Lucene "near realtime"
feature, which ensures you can see immediately a document in the GET API
after it has been created, not affected by the refresh operations.
The scroll id is very compact, there is a slight overhead of managing
them on the heap together with encoding/decoding them, but that is
minimal. If the scroll id life time has exceeded, you will get an error
in the search API, and the scroll search resources will get garbage
collected.
Jörg
Am 09.04.13 19:39, schrieb Jeffrey Gerard:
I want to page through (unsorted) search results in a way that
provides consistent results from one page to the next -- ideally even
if there are docs being indexed/deleted at the same time. I will have
potentially thousands of concurrent searches, but the paging for each
individual search will happen programmatically, so all page requests
for the same search will happen and finish within the period of a few
seconds or less.
Using from/size parameters is not self-consistent during concurrent
indexes. I also wonder if, even when there are not concurrent writes,
it's guaranteed to be self-consistent from one page to the next (when
no sorting is specified) ... this claim is not documented anyplace.
search_type=scroll purports to do exactly what I need. I like that
all pages of results correspond to the same search timestamp and that
results are consistent without the overhead of sorting large result
sets. Because I'm searching programmatically, I can use scroll=5s.
However, the documentation says
http://www.elasticsearch.org/guide/reference/api/search/scroll/ I
shouldn't use scrolling for "real time user requests"; I presume it's
storing some state on the data nodes within the expiry time. Can you
provide more insight into the reasons behind this restriction? How
significant is the overhead, in practice, of using "scroll" for
real-time queries -- up to a few thousand searches (scroll_ids) open
at the same time, with a quite small expiry?
Thanks!
Jeffrey
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.