Just Pushed: Search Scan Type for effecient large hit set scanning


(Shay Banon) #1

Heya,

Just pushed support for "scan" search type in order to efficiently scan/iterate over a large hit set. Issue here: https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is "start" the scanning, and getting back the number of docs we are going to scan over, and a scroll id. And then, start the scrolling processing, passing the previous response scroll id to the next request. Iteration is complete when no hits are back.

-shay.banon


(Paul Smith) #2

Very excited to take this out on the open road and test this in a large
index, thanks for your efforts Shay!

On 22 February 2011 09:14, Shay Banon shay.banon@elasticsearch.com wrote:

Heya,

Just pushed support for "scan" search type in order to efficiently
scan/iterate over a large hit set. Issue here:
https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is "start" the scanning, and getting back the
number of docs we are going to scan over, and a scroll id. And then, start
the scrolling processing, passing the previous response scroll id to the
next request. Iteration is complete when no hits are back.

-shay.banon


(Shay Banon) #3

Great!, as always, feedback on the API / usage is greatly appreciated. (and this one was tricky :slight_smile: ).
On Tuesday, February 22, 2011 at 12:35 AM, Paul Smith wrote:

Very excited to take this out on the open road and test this in a large index, thanks for your efforts Shay!

On 22 February 2011 09:14, Shay Banon shay.banon@elasticsearch.com wrote:

Heya,

Just pushed support for "scan" search type in order to efficiently scan/iterate over a large hit set. Issue here: https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is "start" the scanning, and getting back the number of docs we are going to scan over, and a scroll id. And then, start the scrolling processing, passing the previous response scroll id to the next request. Iteration is complete when no hits are back.

-shay.banon


(Barsk) #4

Great! I can see this coming in handy.
Any hints of the threshold where using this technique is getting better
results than the traditional? I.e is it worth using when the number of
hits > 100 or 100.000. Or is it best to use always if you have your own
"paging" handling of the results in your app. I.e showing 20 hits at the
time and scrolling that?

Just looking for if there are any tradeoffs one should consider.

/Kristian

Shay Banon skrev 2011-02-21 23:14:

Heya,

Just pushed support for "scan" search type in order to efficiently
scan/iterate over a large hit set. Issue here:
https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is "start" the scanning, and getting back the
number of docs we are going to scan over, and a scroll id. And then,
start the scrolling processing, passing the previous response scroll
id to the next request. Iteration is complete when no hits are back.

-shay.banon


(Karussell) #5

Kristian,

if you have unlimited memory you can always use the traditional
approach.
I had only some GB of RAM and the limit of the traditional approach
was for me >300.000 documents

Important: "Note, scan search type does not support sorting (either on
score or a field) or faceting."

Regards,
Peter.

On 22 Feb., 08:49, Kristian Jörg k...@devo.se wrote:

Great! I can see this coming in handy.
Any hints of the threshold where using this technique is getting better
results than the traditional? I.e is it worth using when the number of
hits > 100 or 100.000. Or is it best to use always if you have your own
"paging" handling of the results in your app. I.e showing 20 hits at the
time and scrolling that?

Just looking for if there are any tradeoffs one should consider.

/Kristian

Shay Banon skrev 2011-02-21 23:14:

Heya,

Just pushed support for "scan" search type in order to efficiently
scan/iterate over a large hit set. Issue here:
https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is "start" the scanning, and getting back the
number of docs we are going to scan over, and a scroll id. And then,
start the scrolling processing, passing the previous response scroll
id to the next request. Iteration is complete when no hits are back.

-shay.banon


(Karussell) #6

Hi Shay,

I'm getting an error ** on:

rsp = client.prepareSearchScroll(scrollId).

setScroll(TimeValue.timeValueMinutes(30)).execute().actionGet();

This happens only for the last search (?) although I break the while
loop when rsp.hits().hits().length ==0

Regards,
Peter.

org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce] ; shardFailures
{SearchContextMissingException[No search context found for id [1151]]}
{SearchContextMissingException[No search context found for id [1155]]}
{SearchContextMissingException[No search context found for id [1154]]}
{SearchContextMissingException[No search context found for id [1152]]}
{SearchContextMissingException[No search context found for id [1153]]}
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:209)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$1300(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$3.onFailure(TransportSearchScrollScanAction.java:199)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:
378)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.executePhase(TransportSearchScrollScanAction.java:184)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$700(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$2.run(TransportSearchScrollScanAction.java:157)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:639)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.innerFinishHim(TransportSearchScrollScanAction.java:226)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:207)
... 9 more


(Shay Banon) #7

Hi Kristian,

This is not meant for executing typical search requests, since it does no sorting (and for example, facets are not really meaningful here). It is more meant for things like reindexing part / all of an index.

-shay.banon
On Tuesday, February 22, 2011 at 9:49 AM, Kristian Jörg wrote:

Great! I can see this coming in handy.
Any hints of the threshold where using this technique is getting better
results than the traditional? I.e is it worth using when the number of
hits > 100 or 100.000. Or is it best to use always if you have your own
"paging" handling of the results in your app. I.e showing 20 hits at the
time and scrolling that?

Just looking for if there are any tradeoffs one should consider.

/Kristian

Shay Banon skrev 2011-02-21 23:14:

Heya,

Just pushed support for "scan" search type in order to efficiently
scan/iterate over a large hit set. Issue here:
https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is "start" the scanning, and getting back the
number of docs we are going to scan over, and a scroll id. And then,
start the scrolling processing, passing the previous response scroll
id to the next request. Iteration is complete when no hits are back.

-shay.banon


(Shay Banon) #8

Heya,

Is there a chance that you can recreate this in a testcase? Check SearchScanTests for simple tests for scanning.

-shay.banon
On Tuesday, February 22, 2011 at 4:14 PM, Karussell wrote:

Hi Shay,

I'm getting an error ** on:

rsp = client.prepareSearchScroll(scrollId).

setScroll(TimeValue.timeValueMinutes(30)).execute().actionGet();

This happens only for the last search (?) although I break the while
loop when rsp.hits().hits().length ==0

Regards,
Peter.

org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce] ; shardFailures
{SearchContextMissingException[No search context found for id [1151]]}
{SearchContextMissingException[No search context found for id [1155]]}
{SearchContextMissingException[No search context found for id [1154]]}
{SearchContextMissingException[No search context found for id [1152]]}
{SearchContextMissingException[No search context found for id [1153]]}
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:209)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$1300(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$3.onFailure(TransportSearchScrollScanAction.java:199)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:
378)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.executePhase(TransportSearchScrollScanAction.java:184)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$700(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$2.run(TransportSearchScrollScanAction.java:157)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:639)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.innerFinishHim(TransportSearchScrollScanAction.java:226)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:207)
... 9 more


(Paul Smith) #9

Heya, do you have a WetFinger(tm) wild-ass-guessitimate/target for release
of 0.16? (obviously 0.15 was release just last week, so I'm not expecting
it this week.. :slight_smile: ) Next month-ish? April? purely for planning purposes
here for me.

On 22 February 2011 09:14, Shay Banon shay.banon@elasticsearch.com wrote:

Heya,

Just pushed support for "scan" search type in order to efficiently
scan/iterate over a large hit set. Issue here:
https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is "start" the scanning, and getting back the
number of docs we are going to scan over, and a scroll id. And then, start
the scrolling processing, passing the previous response scroll id to the
next request. Iteration is complete when no hits are back.

-shay.banon


(Shay Banon) #10

I aim at getting on ~1 month release cycle. Would be great if you can test it before and point on problems of course :slight_smile:
On Wednesday, February 23, 2011 at 12:05 AM, Paul Smith wrote:

Heya, do you have a WetFinger(tm) wild-ass-guessitimate/target for release of 0.16? (obviously 0.15 was release just last week, so I'm not expecting it this week.. :slight_smile: ) Next month-ish? April? purely for planning purposes here for me.

On 22 February 2011 09:14, Shay Banon shay.banon@elasticsearch.com wrote:

Heya,

Just pushed support for "scan" search type in order to efficiently scan/iterate over a large hit set. Issue here: https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is "start" the scanning, and getting back the number of docs we are going to scan over, and a scroll id. And then, start the scrolling processing, passing the previous response scroll id to the next request. Iteration is complete when no hits are back.

-shay.banon


(Karussell) #11

I couldn't reproduce the exception but there is a problem/bug:

Although the bulkUpdate in-between seems to be unrelated ... without
it all is fine!
maybe when this is solved then the exception goes away :slight_smile: !?

Regards,
Peter.

On 22 Feb., 20:05, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Is there a chance that you can recreate this in a testcase? Check SearchScanTests for simple tests for scanning.

-shay.banon

On Tuesday, February 22, 2011 at 4:14 PM, Karussell wrote:

Hi Shay,

I'm getting an error ** on:

rsp = client.prepareSearchScroll(scrollId).

setScroll(TimeValue.timeValueMinutes(30)).execute().actionGet();

This happens only for the last search (?) although I break the while
loop when rsp.hits().hits().length ==0

Regards,
Peter.

org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce] ; shardFailures
{SearchContextMissingException[No search context found for id [1151]]}
{SearchContextMissingException[No search context found for id [1155]]}
{SearchContextMissingException[No search context found for id [1154]]}
{SearchContextMissingException[No search context found for id [1152]]}
{SearchContextMissingException[No search context found for id [1153]]}
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:209)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$1300(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$3.onFailure(TransportSearchScrollScanAction.java:199)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:
378)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.executePhase(TransportSearchScrollScanAction.java:184)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$700(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$2.run(TransportSearchScrollScanAction.java:157)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:639)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.innerFinishHim(TransportSearchScrollScanAction.java:226)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:207)
... 9 more


(Shay Banon) #12

Your tests is wrong, since you add to the expectedIds2 the ids from the unrelatedindex as well as test2, but then only search for test2 (in the second test)...
On Wednesday, February 23, 2011 at 6:07 PM, Karussell wrote:

I couldn't reproduce the exception but there is a problem/bug:

https://gist.github.com/840604

Although the bulkUpdate in-between seems to be unrelated ... without
it all is fine!
maybe when this is solved then the exception goes away :slight_smile: !?

Regards,
Peter.

On 22 Feb., 20:05, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Is there a chance that you can recreate this in a testcase? Check SearchScanTests for simple tests for scanning.

-shay.banon

On Tuesday, February 22, 2011 at 4:14 PM, Karussell wrote:

Hi Shay,

I'm getting an error ** on:

rsp = client.prepareSearchScroll(scrollId).

setScroll(TimeValue.timeValueMinutes(30)).execute().actionGet();

This happens only for the last search (?) although I break the while
loop when rsp.hits().hits().length ==0

Regards,
Peter.

org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce] ; shardFailures
{SearchContextMissingException[No search context found for id [1151]]}
{SearchContextMissingException[No search context found for id [1155]]}
{SearchContextMissingException[No search context found for id [1154]]}
{SearchContextMissingException[No search context found for id [1152]]}
{SearchContextMissingException[No search context found for id [1153]]}
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:209)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$1300(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$3.onFailure(TransportSearchScrollScanAction.java:199)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:
378)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.executePhase(TransportSearchScrollScanAction.java:184)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$700(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$2.run(TransportSearchScrollScanAction.java:157)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:639)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.innerFinishHim(TransportSearchScrollScanAction.java:226)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:207)
... 9 more


(Karussell) #13

ups ...

On 24 Feb., 05:12, Shay Banon shay.ba...@elasticsearch.com wrote:

Your tests is wrong, since you add to the expectedIds2 the ids from the unrelatedindex as well as test2, but then only search for test2 (in the second test)...

On Wednesday, February 23, 2011 at 6:07 PM, Karussell wrote:

I couldn't reproduce the exception but there is a problem/bug:

https://gist.github.com/840604

Although the bulkUpdate in-between seems to be unrelated ... without
it all is fine!
maybe when this is solved then the exception goes away :slight_smile: !?

Regards,
Peter.

On 22 Feb., 20:05, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Is there a chance that you can recreate this in a testcase? Check SearchScanTests for simple tests for scanning.

-shay.banon

On Tuesday, February 22, 2011 at 4:14 PM, Karussell wrote:

Hi Shay,

I'm getting an error ** on:

rsp = client.prepareSearchScroll(scrollId).

setScroll(TimeValue.timeValueMinutes(30)).execute().actionGet();

This happens only for the last search (?) although I break the while
loop when rsp.hits().hits().length ==0

Regards,
Peter.

org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce] ; shardFailures
{SearchContextMissingException[No search context found for id [1151]]}
{SearchContextMissingException[No search context found for id [1155]]}
{SearchContextMissingException[No search context found for id [1154]]}
{SearchContextMissingException[No search context found for id [1152]]}
{SearchContextMissingException[No search context found for id [1153]]}
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:209)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$1300(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$3.onFailure(TransportSearchScrollScanAction.java:199)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:
378)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.executePhase(TransportSearchScrollScanAction.java:184)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.access$700(TransportSearchScrollScanAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction$2.run(TransportSearchScrollScanAction.java:157)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less
than size (0)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
301)
at
org.elasticsearch.common.base.Preconditions.checkElementIndex(Preconditions.java:
280)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:639)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:
259)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.innerFinishHim(TransportSearchScrollScanAction.java:226)
at
org.elasticsearch.action.search.type.TransportSearchScrollScanAction
$AsyncAction.finishHim(TransportSearchScrollScanAction.java:207)
... 9 more


(Barsk) #14

Ok!

Thanks for clearifying!

Regards
Kristian

Shay Banon skrev 2011-02-22 18:52:

Hi Kristian,

This is not meant for executing typical search requests, since it
does no sorting (and for example, facets are not really meaningful
here). It is more meant for things like reindexing part / all of an index.

-shay.banon

On Tuesday, February 22, 2011 at 9:49 AM, Kristian Jörg wrote:

Great! I can see this coming in handy.
Any hints of the threshold where using this technique is getting better
results than the traditional? I.e is it worth using when the number of
hits > 100 or 100.000. Or is it best to use always if you have your own
"paging" handling of the results in your app. I.e showing 20 hits at the
time and scrolling that?

Just looking for if there are any tradeoffs one should consider.

/Kristian

Shay Banon skrev 2011-02-21 23:14:

Heya,

Just pushed support for "scan" search type in order to efficiently
scan/iterate over a large hit set. Issue here:
https://github.com/elasticsearch/elasticsearch/issues/707.

The idea of using it is ""start" the scanning, and getting back the
number of docs we are going to scan over, and a scroll id. And then,
start the scrolling processing, passing the previous response scroll
id to the next request. Iteration is complete when no hits are back.

-shay.banon

--
Med vänlig hälsning
Kristian Jörg

Devo IT AB
Tel: 054 - 22 14 58, 0709 - 15 83 42
E-post: kristian.jorg@devo.se
Webb: http://www.devo.se


(system) #15