Hi Alex,
I've recently been evaluating a very similar use case, you might be
interested in these threads (unique scroll
idshttps://groups.google.com/forum/#!topic/elasticsearch/tqmwhuDADFwand
characteristics
of scrollinghttps://groups.google.com/forum/#!topic/elasticsearch/55ByVEr93bo
).
I can't speak to different versions of ES or codebase specifics, but my
guess is that what you're seeing is similar to what I see in 0.20.4:
- A _scroll_id will remain the same for some number of requests issued
against it, and at some point will change.
- Issuing multiple requests against the same _scroll_id will yield
subsequent results (at least for the overall resultset that that _scroll_id
represents before it changes), not the same results.
- You must use the most recent _scroll_id (as you mentioned) to be sure
to accurately get the full set of results for the query.
If you're not seeing the scroll id change in 0.19. I would guess that your
query is small enough that it doesn't end up giving you a new id, but
that's just a hunch.
To conclude, I don't *think *you'll be able to achieve what you want using
scrollers (at least that's the conclusion I've come to - I'd love to hear
otherwise). An alternative approach is using from/to attributes of a normal
query and have each mapper operate on a certain range of a query.
On Mon, Jun 17, 2013 at 9:28 AM, Alex at Ikanow apiggott@ikanow.com wrote:
I've been thinking about ways of running Hadoop jobs on the results of
elasticsearch queries.
I was hoping that I could just distribute a scrollid to the different
mappers and then have them read and process batches of data served as
needed from their closest es node (data locality isn't really an issue
here, the hadoop and es nodes will normally not be co-located)
The documentation is pretty clear that this shouldn't work (
Elasticsearch Platform — Find real-time answers at scale | Elastic) "Note:
the scroll_id changes for each scroll request and only the most recent
one should be used" ... but running on 0.19, the scroll id did appear to
remain the same .... does anybody know which of the following is true:
- On all versions the scrollid can change, though sometimes won't
- It didn't change up until version XXX, but now does
- It doesn't change, but we put that in because we might need to make it
change in the future?
(It would also require the get-more-data operation to be atomic across
multiple nodes, which not be the case)
Any pointers to good spots in the codebase to look at the inner workings
of scrolls? (I always get a bit lost in the middle of the es code!)
Thanks for any insight/pointers
Alex
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.