Distributing query results (using scrolling?)

I've been thinking about ways of running Hadoop jobs on the results of
elasticsearch queries.

I was hoping that I could just distribute a scrollid to the different
mappers and then have them read and process batches of data served as
needed from their closest es node (data locality isn't really an issue
here, the hadoop and es nodes will normally not be co-located)

The documentation is pretty clear that this shouldn't work (
http://www.elasticsearch.org/guide/reference/api/search/scroll/) "Note:
the scroll_id changes for each scroll request and only the most recent one
should be used" ... but running on 0.19, the scroll id did appear to remain
the same .... does anybody know which of the following is true:

  • On all versions the scrollid can change, though sometimes won't
  • It didn't change up until version XXX, but now does
  • It doesn't change, but we put that in because we might need to make it
    change in the future?

(It would also require the get-more-data operation to be atomic across
multiple nodes, which not be the case)

Any pointers to good spots in the codebase to look at the inner workings of
scrolls? (I always get a bit lost in the middle of the es code!)

Thanks for any insight/pointers

Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Alex,

I've recently been evaluating a very similar use case, you might be
interested in these threads (unique scroll
idshttps://groups.google.com/forum/#!topic/elasticsearch/tqmwhuDADFwand
characteristics
of scrollinghttps://groups.google.com/forum/#!topic/elasticsearch/55ByVEr93bo
).

I can't speak to different versions of ES or codebase specifics, but my
guess is that what you're seeing is similar to what I see in 0.20.4:

  • A _scroll_id will remain the same for some number of requests issued
    against it, and at some point will change.
  • Issuing multiple requests against the same _scroll_id will yield
    subsequent results (at least for the overall resultset that that _scroll_id
    represents before it changes), not the same results.
  • You must use the most recent _scroll_id (as you mentioned) to be sure
    to accurately get the full set of results for the query.

If you're not seeing the scroll id change in 0.19. I would guess that your
query is small enough that it doesn't end up giving you a new id, but
that's just a hunch.

To conclude, I don't *think *you'll be able to achieve what you want using
scrollers (at least that's the conclusion I've come to - I'd love to hear
otherwise). An alternative approach is using from/to attributes of a normal
query and have each mapper operate on a certain range of a query.

  • oli

On Mon, Jun 17, 2013 at 9:28 AM, Alex at Ikanow apiggott@ikanow.com wrote:

I've been thinking about ways of running Hadoop jobs on the results of
elasticsearch queries.

I was hoping that I could just distribute a scrollid to the different
mappers and then have them read and process batches of data served as
needed from their closest es node (data locality isn't really an issue
here, the hadoop and es nodes will normally not be co-located)

The documentation is pretty clear that this shouldn't work (
Elasticsearch Platform — Find real-time answers at scale | Elastic) "Note:
the scroll_id changes for each scroll request and only the most recent
one should be used" ... but running on 0.19, the scroll id did appear to
remain the same .... does anybody know which of the following is true:

  • On all versions the scrollid can change, though sometimes won't
  • It didn't change up until version XXX, but now does
  • It doesn't change, but we put that in because we might need to make it
    change in the future?

(It would also require the get-more-data operation to be atomic across
multiple nodes, which not be the case)

Any pointers to good spots in the codebase to look at the inner workings
of scrolls? (I always get a bit lost in the middle of the es code!)

Thanks for any insight/pointers

Alex

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey thanks for sharing your experiences and those links

If there's 2 of us then maybe that's enough for a feature request :slight_smile:

Hmm looking at the code, it seems like all the scroll function does is save
and apply the from/to information anyway (
https://github.com/elasticsearch/elasticsearch/blob/f09ad507a44367ed6ca29c6f3dae7659a2da1994/src/main/java/org/elasticsearch/search/SearchService.java),
so no reason not to do as you suggest!

(The existing elasticsearch-hadoop connector incidentally generates 1
mapper per shard and scrolls through that, which might be as well as you
can do anyway)

On Monday, June 17, 2013 12:50:23 PM UTC-4, Oli wrote:

Hi Alex,

I've recently been evaluating a very similar use case, you might be
interested in these threads (unique scroll idshttps://groups.google.com/forum/#!topic/elasticsearch/tqmwhuDADFwand characteristics
of scrollinghttps://groups.google.com/forum/#!topic/elasticsearch/55ByVEr93bo
).

I can't speak to different versions of ES or codebase specifics, but my
guess is that what you're seeing is similar to what I see in 0.20.4:

  • A _scroll_id will remain the same for some number of requests issued
    against it, and at some point will change.
  • Issuing multiple requests against the same _scroll_id will yield
    subsequent results (at least for the overall resultset that that _scroll_id
    represents before it changes), not the same results.
  • You must use the most recent _scroll_id (as you mentioned) to be
    sure to accurately get the full set of results for the query.

If you're not seeing the scroll id change in 0.19. I would guess that your
query is small enough that it doesn't end up giving you a new id, but
that's just a hunch.

To conclude, I don't *think *you'll be able to achieve what you want
using scrollers (at least that's the conclusion I've come to - I'd love to
hear otherwise). An alternative approach is using from/to attributes of a
normal query and have each mapper operate on a certain range of a query.

  • oli

On Mon, Jun 17, 2013 at 9:28 AM, Alex at Ikanow <apig...@ikanow.com<javascript:>

wrote:

I've been thinking about ways of running Hadoop jobs on the results of
elasticsearch queries.

I was hoping that I could just distribute a scrollid to the different
mappers and then have them read and process batches of data served as
needed from their closest es node (data locality isn't really an issue
here, the hadoop and es nodes will normally not be co-located)

The documentation is pretty clear that this shouldn't work (
Elasticsearch Platform — Find real-time answers at scale | Elastic) "Note:
the scroll_id changes for each scroll request and only the most recent
one should be used" ... but running on 0.19, the scroll id did appear to
remain the same .... does anybody know which of the following is true:

  • On all versions the scrollid can change, though sometimes won't
  • It didn't change up until version XXX, but now does
  • It doesn't change, but we put that in because we might need to make it
    change in the future?

(It would also require the get-more-data operation to be atomic across
multiple nodes, which not be the case)

Any pointers to good spots in the codebase to look at the inner workings
of scrolls? (I always get a bit lost in the middle of the es code!)

Thanks for any insight/pointers

Alex

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

(Clarification: I'm guessing it caches a block of matching _ids then
scrolls through that using the from/to construct until you reach the end
and then it does another query, which is when the scrollid changes, or
something like that. So the problem with multiple to/from queries is that
you'll redo the same query over and over, although some subset of it will
be cached so will run more quickly subsequent times)

On Monday, June 17, 2013 2:24:13 PM UTC-4, Alex at Ikanow wrote:

Hey thanks for sharing your experiences and those links

If there's 2 of us then maybe that's enough for a feature request :slight_smile:

Hmm looking at the code, it seems like all the scroll function does is
save and apply the from/to information anyway (
https://github.com/elasticsearch/elasticsearch/blob/f09ad507a44367ed6ca29c6f3dae7659a2da1994/src/main/java/org/elasticsearch/search/SearchService.java),
so no reason not to do as you suggest!

(The existing elasticsearch-hadoop connector incidentally generates 1
mapper per shard and scrolls through that, which might be as well as you
can do anyway)

On Monday, June 17, 2013 12:50:23 PM UTC-4, Oli wrote:

Hi Alex,

I've recently been evaluating a very similar use case, you might be
interested in these threads (unique scroll idshttps://groups.google.com/forum/#!topic/elasticsearch/tqmwhuDADFwand characteristics
of scrollinghttps://groups.google.com/forum/#!topic/elasticsearch/55ByVEr93bo
).

I can't speak to different versions of ES or codebase specifics, but my
guess is that what you're seeing is similar to what I see in 0.20.4:

  • A _scroll_id will remain the same for some number of requests
    issued against it, and at some point will change.
  • Issuing multiple requests against the same _scroll_id will yield
    subsequent results (at least for the overall resultset that that _scroll_id
    represents before it changes), not the same results.
  • You must use the most recent _scroll_id (as you mentioned) to be
    sure to accurately get the full set of results for the query.

If you're not seeing the scroll id change in 0.19. I would guess that
your query is small enough that it doesn't end up giving you a new id, but
that's just a hunch.

To conclude, I don't *think *you'll be able to achieve what you want
using scrollers (at least that's the conclusion I've come to - I'd love to
hear otherwise). An alternative approach is using from/to attributes of a
normal query and have each mapper operate on a certain range of a query.

  • oli

On Mon, Jun 17, 2013 at 9:28 AM, Alex at Ikanow apig...@ikanow.comwrote:

I've been thinking about ways of running Hadoop jobs on the results of
elasticsearch queries.

I was hoping that I could just distribute a scrollid to the different
mappers and then have them read and process batches of data served as
needed from their closest es node (data locality isn't really an issue
here, the hadoop and es nodes will normally not be co-located)

The documentation is pretty clear that this shouldn't work (
Elasticsearch Platform — Find real-time answers at scale | Elastic) "Note:
the scroll_id changes for each scroll request and only the most recent
one should be used" ... but running on 0.19, the scroll id did appear
to remain the same .... does anybody know which of the following is true:

  • On all versions the scrollid can change, though sometimes won't
  • It didn't change up until version XXX, but now does
  • It doesn't change, but we put that in because we might need to make
    it change in the future?

(It would also require the get-more-data operation to be atomic across
multiple nodes, which not be the case)

Any pointers to good spots in the codebase to look at the inner workings
of scrolls? (I always get a bit lost in the middle of the es code!)

Thanks for any insight/pointers

Alex

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.