Data distribution over shards and replicas

Hi,

I've started working on elasticsearch and having some doubts about shards
and replicas and how they handle data. I don't have any prior knowledge on
Lucene.
As I know lucene will split data in segments and store in disk, and shard
is the lucene index itself. Some of the doubts which I have is...

  1. There are two way we can do shard allocation, one in cluster level with
    config settings and another in index level settings. Suppose in cluster
    level I mentioned max shard is 3 and in index level I mentioned 5 shards,
    how the shards will be allocated? I have one cluster one node.

  2. Suppose, one index is having 5 shards and 2 replicas and I'm pushing
    data in bulk api, how the data will be stored? Is same data will be stored
    in 5 shards or the data will split and store in chunks in 5 shards? How
    replicas will have backup of data of all 5 shards?

  3. Suppose I have 5 nodes and 10 shards are distributed over the nodes, 2
    shards each. So when I index new documents how the data will be stored in
    over the nodes?
    Suppose the 5th node goes down suddenly which is holding 9th and 10th
    shard. Now do I loose all the data stored in 9th and 10th shard or the data
    are already copied in rest of the nodes ?

Please explain.

Thanks,
Subhadip

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4ac575bd-0d0a-4f5f-972e-7f3c54f2eb85%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

1 - it will use 5 as you've specifically set that, anything you don't
specify will just use the cluster default.
2 - The data isn't replicated into all the shards, it splits the complete
data up into 5 shards. Then each replica set will contain a copy of the
data, which is then also sharded.
3 - ES will distribute data across all shards as best it can. It will also
not store the replicas for a shard on the same node as the primary. So if a
node that holds a primary shard dies, thena secondary shard will be
promoted to primary.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 1 April 2014 06:49, Subhadip Bagui i.bagui@gmail.com wrote:

Hi,

I've started working on elasticsearch and having some doubts about shards
and replicas and how they handle data. I don't have any prior knowledge on
Lucene.
As I know lucene will split data in segments and store in disk, and shard
is the lucene index itself. Some of the doubts which I have is...

  1. There are two way we can do shard allocation, one in cluster level with
    config settings and another in index level settings. Suppose in cluster
    level I mentioned max shard is 3 and in index level I mentioned 5 shards,
    how the shards will be allocated? I have one cluster one node.

  2. Suppose, one index is having 5 shards and 2 replicas and I'm pushing
    data in bulk api, how the data will be stored? Is same data will be stored
    in 5 shards or the data will split and store in chunks in 5 shards? How
    replicas will have backup of data of all 5 shards?

  3. Suppose I have 5 nodes and 10 shards are distributed over the nodes, 2
    shards each. So when I index new documents how the data will be stored in
    over the nodes?
    Suppose the 5th node goes down suddenly which is holding 9th and 10th
    shard. Now do I loose all the data stored in 9th and 10th shard or the data
    are already copied in rest of the nodes ?

Please explain.

Thanks,
Subhadip

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4ac575bd-0d0a-4f5f-972e-7f3c54f2eb85%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/4ac575bd-0d0a-4f5f-972e-7f3c54f2eb85%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624anWNJjEnTDQuUqaZ2Wy8mmagth3Occ71QQMR_WO0obtQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Mark for the prompt reply, I have some more doubts

  1. Suppose one index is running with 3 shards and 1 replica and other index
    is running with the cluster settings i.e. 5 shards 2 replica then total 3+1
    or 5+2 shards will be available in cluster? I have installed
    elasticsearch-head plugin but the replica shard is not showing there.

For data distribution, replica shard also keeps other index documents or it
will be used to keep backup copy of data only.

  1. So documents under same index will be split due to sharding and
    distribute over the shards right ? Can we push all the documents for same
    index in a particular shard? I don't want to use custom routing as then I
    need one field value common for all the documents. How can we find out
    which shard is holding which documents?

  2. If I make one index with 2 shards and no replica and the node in cluster
    holding this 2 shards dies, then will I lose the data, or the data will
    have a copy in cluster level replica? If I have only 1 replica and the node
    holds the replica dies then how the backup will happen?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d7d0243-dcd1-4ac7-9fef-1d6e44599ea1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

1 - Data from both will be available, you've just told ES not to use the
defaults for one index. A replica is not a backup, it's a 1:1 replica so it
will contain the same data as the primary shard.
2 - Not sure, but I don't think so as lucene will try to split things.
Routing is the recommended method for what you want.
3 - Yes, although you are unlikely to have them both on one node unless it
is a single node cluster. What do mean by backup? If you're talking about
replicas instead then the cluster will build a new replica if one dies.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 3 April 2014 00:21, Subhadip Bagui i.bagui@gmail.com wrote:

Thanks Mark for the prompt reply, I have some more doubts

  1. Suppose one index is running with 3 shards and 1 replica and other
    index is running with the cluster settings i.e. 5 shards 2 replica then
    total 3+1 or 5+2 shards will be available in cluster? I have installed
    elasticsearch-head plugin but the replica shard is not showing there.

For data distribution, replica shard also keeps other index documents or
it will be used to keep backup copy of data only.

  1. So documents under same index will be split due to sharding and
    distribute over the shards right ? Can we push all the documents for same
    index in a particular shard? I don't want to use custom routing as then I
    need one field value common for all the documents. How can we find out
    which shard is holding which documents?

  2. If I make one index with 2 shards and no replica and the node in
    cluster holding this 2 shards dies, then will I lose the data, or the data
    will have a copy in cluster level replica? If I have only 1 replica and
    the node holds the replica dies then how the backup will happen?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4d7d0243-dcd1-4ac7-9fef-1d6e44599ea1%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/4d7d0243-dcd1-4ac7-9fef-1d6e44599ea1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624YnT8-1mHTJ-%3D8RmZhmn5MZugJ0cL39zdKwifG8o98myw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks a lot Mark. That explains a lot.

By backup I meant copy of same data.

One last question, for fast searching what will be the better selection?
single index multiple shards or multiple index single shard?

Can you please give some reference how lucene splits documents and store in
shards. That will help to get better idea.

Thanks,
Subhadip

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3785b01c-328f-4f4c-8dab-db93b73b2b5c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Depends on your data and cluster configuration, but you probably don't want
a single shard unless it's a tiny, tiny index.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 3 April 2014 15:08, Subhadip Bagui i.bagui@gmail.com wrote:

Thanks a lot Mark. That explains a lot.

By backup I meant copy of same data.

One last question, for fast searching what will be the better selection?
single index multiple shards or multiple index single shard?

Can you please give some reference how lucene splits documents and store
in shards. That will help to get better idea.

Thanks,
Subhadip

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3785b01c-328f-4f4c-8dab-db93b73b2b5c%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/3785b01c-328f-4f4c-8dab-db93b73b2b5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624YK3iUQ7Gm4-nGtqyorz8F4U3GbJPwwAuOvft_WBrLktQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.