Is it possible to explicitly set the shard id to search on in a query

dobe · March 5, 2012, 8:56am

hi

for backup purposes we would like to scan all docs of a given shard id.

is it somehow possible to do this? along with preference=_primary this
would allow smaller granularity when exporting data in parallell.
additionally it would allow to better control data consistency when an
application is designed to have related data on the same shard.

if not, would it be feasable to implement a special _routing value e.g. an
integer and then use the integer directly instead of hashing it, even
though this would break backwards compatibility?

another option i can think of is to support a _shardid attribute on queries

thanks in advance, bernd

kimchy · March 5, 2012, 3:23pm

There isn't an option to do it, and I am very reluctant to add it (its best if its kept internal). But, it makes little sense to still have the sharding aspect of ES factored into a backup when its done with scan, no?

On Monday, March 5, 2012 at 10:56 AM, dobe wrote:

hi

for backup purposes we would like to scan all docs of a given shard id.

is it somehow possible to do this? along with preference=_primary this would allow smaller granularity when exporting data in parallell. additionally it would allow to better control data consistency when an application is designed to have related data on the same shard.

if not, would it be feasable to implement a special _routing value e.g. an integer and then use the integer directly instead of hashing it, even though this would break backwards compatibility?

another option i can think of is to support a _shardid attribute on queries

thanks in advance, bernd

dobe · March 5, 2012, 7:23pm

the point where it makes sense is to have one backup entity per shard, when
backing up all shards at once via parallel requests e.g. into hdfs. if the
application uses routing it would be easy to keep related data in the same
data chunk, which will lead to better consistency of related data (e.g
deleted comments of a blog post). i know that this could also be done
without a scan by just rsyncing the shards, but rsyncing does not allow to
restore data into another index (e.g for changing number of shards) or
different versions of elasticsearch.

On Monday, March 5, 2012 4:23:29 PM UTC+1, kimchy wrote:

There isn't an option to do it, and I am very reluctant to add it (its
best if its kept internal). But, it makes little sense to still have the
sharding aspect of ES factored into a backup when its done with scan, no?

On Monday, March 5, 2012 at 10:56 AM, dobe wrote:

hi

for backup purposes we would like to scan all docs of a given shard id.

is it somehow possible to do this? along with preference=_primary this
would allow smaller granularity when exporting data in parallell.
additionally it would allow to better control data consistency when an
application is designed to have related data on the same shard.

if not, would it be feasable to implement a special _routing value e.g. an
integer and then use the integer directly instead of hashing it, even
though this would break backwards compatibility?

another option i can think of is to support a _shardid attribute on queries

thanks in advance, bernd

kimchy · March 6, 2012, 7:46am

So what you are looking for is parallelism? Data integrity is guaranteed, regardless if you scan just one shard, or several, at the same time (its a point in time snapshot scan of when you started to execute the scan). I think that a feature I had in mind will come in handy in this case, which is parallel scan allowing to start multiple parallel scans across the data (i.e. return several scroll ids, each for a subset of the data).

On Monday, March 5, 2012 at 9:23 PM, dobe wrote:

the point where it makes sense is to have one backup entity per shard, when backing up all shards at once via parallel requests e.g. into hdfs. if the application uses routing it would be easy to keep related data in the same data chunk, which will lead to better consistency of related data (e.g deleted comments of a blog post). i know that this could also be done without a scan by just rsyncing the shards, but rsyncing does not allow to restore data into another index (e.g for changing number of shards) or different versions of elasticsearch.

On Monday, March 5, 2012 4:23:29 PM UTC+1, kimchy wrote:

There isn't an option to do it, and I am very reluctant to add it (its best if its kept internal). But, it makes little sense to still have the sharding aspect of ES factored into a backup when its done with scan, no?

On Monday, March 5, 2012 at 10:56 AM, dobe wrote:

hi

for backup purposes we would like to scan all docs of a given shard id.

is it somehow possible to do this? along with preference=_primary this would allow smaller granularity when exporting data in parallell. additionally it would allow to better control data consistency when an application is designed to have related data on the same shard.

if not, would it be feasable to implement a special _routing value e.g. an integer and then use the integer directly instead of hashing it, even though this would break backwards compatibility?

another option i can think of is to support a _shardid attribute on queries

thanks in advance, bernd

rockbobsta · August 27, 2013, 1:16am

Hi,

I'm just wondering if the parallel scan feature is under any development?
We have an index with 20 shards and approx 125 million docs, using default
routing.
I have a situation where we run quite a lot of large scans to process
customer data, and we're looking for a way to speed up the sequential
scans. The parallel scan would be great in this situation.

Another idea was to implement this by running multiple workers (one per
shard) and enforcing that each worker only hits one shard. Each worker
would run the same scan against each shard, so we should get the same
results but faster as they run in parallel (and hopefully as each query is
only hitting one shard, this may be faster too?)
If there was a way I could determine a routing value for each worker that
would force each worker to the correct shard, I think it would be possible,
just by including routing=[some_key] with the query.
But it would mean working out some inverse of the default routing algorithm
to find a routing value that goes to each shard, and I'm not sure how
simple this is.

Does this sound like a feasible solution? Or does anyone have any
suggestions as to other ways to achieve this?
If I could get some pointers on how to implement I will have a look at
implementing something.

thanks in advance for any assistance.
Bob.

On Tuesday, March 6, 2012 6:46:35 PM UTC+11, kimchy wrote:

So what you are looking for is parallelism? Data integrity is guaranteed,
regardless if you scan just one shard, or several, at the same time (its a
point in time snapshot scan of when you started to execute the scan). I
think that a feature I had in mind will come in handy in this case, which
is parallel scan allowing to start multiple parallel scans across the data
(i.e. return several scroll ids, each for a subset of the data).

On Monday, March 5, 2012 at 9:23 PM, dobe wrote:

the point where it makes sense is to have one backup entity per shard,
when backing up all shards at once via parallel requests e.g. into hdfs. if
the application uses routing it would be easy to keep related data in the
same data chunk, which will lead to better consistency of related data (e.g
deleted comments of a blog post). i know that this could also be done
without a scan by just rsyncing the shards, but rsyncing does not allow to
restore data into another index (e.g for changing number of shards) or
different versions of elasticsearch.

On Monday, March 5, 2012 4:23:29 PM UTC+1, kimchy wrote:

There isn't an option to do it, and I am very reluctant to add it (its
best if its kept internal). But, it makes little sense to still have the
sharding aspect of ES factored into a backup when its done with scan, no?

On Monday, March 5, 2012 at 10:56 AM, dobe wrote:

hi

for backup purposes we would like to scan all docs of a given shard id.

is it somehow possible to do this? along with preference=_primary this
would allow smaller granularity when exporting data in parallell.
additionally it would allow to better control data consistency when an
application is designed to have related data on the same shard.

if not, would it be feasable to implement a special _routing value e.g. an
integer and then use the integer directly instead of hashing it, even
though this would break backwards compatibility?

another option i can think of is to support a _shardid attribute on queries

thanks in advance, bernd

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Parallel scans to speed up scrolling Elasticsearch	4	2276	July 6, 2017
Docs about sharding and scatter/gather Elasticsearch	5	1946	July 6, 2017
Is there an easy way to get the shard of a document? Elasticsearch	17	1585	August 22, 2022
Route documents at index time to a particular shard Elasticsearch	2	383	July 6, 2017
Determine Shard Id based on routing key Elasticsearch	5	1207	July 6, 2017

Is it possible to explicitly set the shard id to search on in a query

Related topics