Parallel scans to speed up scrolling


(rockbobsta) #1

Hi,

apologies if this is a double post, I also added this question in response
to another question but I'm posting it here as I think it's actually a new
topic:
https://groups.google.com/forum/?fromgroups#!starred/elasticsearch/l-YJGJAGJLQ

I'm just wondering if the parallel scan feature mentioned in that document
is under any development?
We have an index with 20 shards and approx 125 million docs, using default
routing.
I have a situation where we run quite a lot of large scans to process
customer data, and we're looking for a way to speed up the sequential
scans. The parallel scan would be great in this situation.

Another idea was to implement this by running multiple workers (one per
shard) and enforcing that each worker only hits one shard. Each worker
would run the same scan against each shard, so we should get the same
results but faster as they run in parallel (and hopefully as each query is
only hitting one shard, this may be faster too?)
If there was a way I could determine a routing value for each worker that
would force each worker to the correct shard, I think it would be possible,
just by including routing=[some_key] with the query.
But it would mean working out some inverse of the default routing algorithm
to find a routing value that goes to each shard, and I'm not sure how
simple this is.

Does this sound like a feasible solution? Or does anyone have any
suggestions as to other ways to achieve this?
If I could get some pointers on how to implement I will have a look at
implementing something.

thanks in advance for any assistance.
Bob.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Martijn Van Groningen) #2

Hi,

A search request (also a scan request) is already executed in parallel. By
default each shard level request is executed in parallel.
How long is a single scan request taking? What did you specify in the
size request option?

It is possible to control to what specific shards a search request goes to
via the preference option:
http://www.elasticsearch.org/guide/reference/api/search/preference/

I think in your case using the preference option is not really fixing the
issue of slow slow scan requests, it is just a work around, parallelism
happens already by default. This is controlled via the
operation_threading option and it defaults to thread_per_shard.

Are there a lot of rejected threads in the search thread pool? This can be
seen in the node stats api (GET localhost:9200/_nodes/stats?all).

Martijn

On 16 September 2013 08:02, rockbobsta bob@figjamit.com.au wrote:

Hi,

apologies if this is a double post, I also added this question in response
to another question but I'm posting it here as I think it's actually a new
topic:

https://groups.google.com/forum/?fromgroups#!starred/elasticsearch/l-YJGJAGJLQ

I'm just wondering if the parallel scan feature mentioned in that document
is under any development?
We have an index with 20 shards and approx 125 million docs, using default
routing.
I have a situation where we run quite a lot of large scans to process
customer data, and we're looking for a way to speed up the sequential
scans. The parallel scan would be great in this situation.

Another idea was to implement this by running multiple workers (one per
shard) and enforcing that each worker only hits one shard. Each worker
would run the same scan against each shard, so we should get the same
results but faster as they run in parallel (and hopefully as each query is
only hitting one shard, this may be faster too?)
If there was a way I could determine a routing value for each worker that
would force each worker to the correct shard, I think it would be possible,
just by including routing=[some_key] with the query.
But it would mean working out some inverse of the default routing
algorithm to find a routing value that goes to each shard, and I'm not sure
how simple this is.

Does this sound like a feasible solution? Or does anyone have any
suggestions as to other ways to achieve this?
If I could get some pointers on how to implement I will have a look at
implementing something.

thanks in advance for any assistance.
Bob.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Henrik Nordvik) #3

According to this page:
http://www.elasticsearch.org/guide/reference/java-api/search/

"The default mode is SINGLE_THREAD."

Henrik

On Monday, September 16, 2013 9:25:04 AM UTC+2, Martijn v Groningen wrote:

Hi,

A search request (also a scan request) is already executed in parallel. By
default each shard level request is executed in parallel.
How long is a single scan request taking? What did you specify in the
size request option?

It is possible to control to what specific shards a search request goes to
via the preference option:
http://www.elasticsearch.org/guide/reference/api/search/preference/

I think in your case using the preference option is not really fixing the
issue of slow slow scan requests, it is just a work around, parallelism
happens already by default. This is controlled via the
operation_threading option and it defaults to thread_per_shard.

Are there a lot of rejected threads in the search thread pool? This can be
seen in the node stats api (GET localhost:9200/_nodes/stats?all).

Martijn

On 16 September 2013 08:02, rockbobsta <b...@figjamit.com.au <javascript:>

wrote:

Hi,

apologies if this is a double post, I also added this question in
response to another question but I'm posting it here as I think it's
actually a new topic:

https://groups.google.com/forum/?fromgroups#!starred/elasticsearch/l-YJGJAGJLQ

I'm just wondering if the parallel scan feature mentioned in that
document is under any development?
We have an index with 20 shards and approx 125 million docs, using
default routing.
I have a situation where we run quite a lot of large scans to process
customer data, and we're looking for a way to speed up the sequential
scans. The parallel scan would be great in this situation.

Another idea was to implement this by running multiple workers (one per
shard) and enforcing that each worker only hits one shard. Each worker
would run the same scan against each shard, so we should get the same
results but faster as they run in parallel (and hopefully as each query is
only hitting one shard, this may be faster too?)
If there was a way I could determine a routing value for each worker that
would force each worker to the correct shard, I think it would be possible,
just by including routing=[some_key] with the query.
But it would mean working out some inverse of the default routing
algorithm to find a routing value that goes to each shard, and I'm not sure
how simple this is.

Does this sound like a feasible solution? Or does anyone have any
suggestions as to other ways to achieve this?
If I could get some pointers on how to implement I will have a look at
implementing something.

thanks in advance for any assistance.
Bob.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Martijn Van Groningen) #4

This documentation is out dated, I'll update it. The default is
THREAD_PER_SHARD.

On 18 September 2013 10:16, Henrik Nordvik henrikno@gmail.com wrote:

According to this page:
http://www.elasticsearch.org/guide/reference/java-api/search/

"The default mode is SINGLE_THREAD."

Henrik

On Monday, September 16, 2013 9:25:04 AM UTC+2, Martijn v Groningen wrote:

Hi,

A search request (also a scan request) is already executed in parallel.
By default each shard level request is executed in parallel.
How long is a single scan request taking? What did you specify in the
size request option?

It is possible to control to what specific shards a search request goes
to via the preference option:
http://www.elasticsearch.org/**guide/reference/api/search/**preference/http://www.elasticsearch.org/guide/reference/api/search/preference/

I think in your case using the preference option is not really fixing the
issue of slow slow scan requests, it is just a work around, parallelism
happens already by default. This is controlled via the
operation_threading option and it defaults to thread_per_shard.

Are there a lot of rejected threads in the search thread pool? This can
be seen in the node stats api (GET localhost:9200/_nodes/stats?**all).

Martijn

On 16 September 2013 08:02, rockbobsta b...@figjamit.com.au wrote:

Hi,

apologies if this is a double post, I also added this question in
response to another question but I'm posting it here as I think it's
actually a new topic:
https://groups.google.com/forum/?fromgroups#!starred/
elasticsearch/l-YJGJAGJLQhttps://groups.google.com/forum/?fromgroups#!starred/elasticsearch/l-YJGJAGJLQ

I'm just wondering if the parallel scan feature mentioned in that
document is under any development?
We have an index with 20 shards and approx 125 million docs, using
default routing.
I have a situation where we run quite a lot of large scans to process
customer data, and we're looking for a way to speed up the sequential
scans. The parallel scan would be great in this situation.

Another idea was to implement this by running multiple workers (one per
shard) and enforcing that each worker only hits one shard. Each worker
would run the same scan against each shard, so we should get the same
results but faster as they run in parallel (and hopefully as each query is
only hitting one shard, this may be faster too?)
If there was a way I could determine a routing value for each worker
that would force each worker to the correct shard, I think it would be
possible, just by including routing=[some_key] with the query.
But it would mean working out some inverse of the default routing
algorithm to find a routing value that goes to each shard, and I'm not sure
how simple this is.

Does this sound like a feasible solution? Or does anyone have any
suggestions as to other ways to achieve this?
If I could get some pointers on how to implement I will have a look at
implementing something.

thanks in advance for any assistance.
Bob.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5