As far as i have explored ES , what i have understood is

A replica of index will have whole documents and it wud b stored in
that box.

A shard is a lucene object which holds a part of the whole index. That
is if the number of shards to a index is 5 , first shard will have the
first 20% of the index data , second one will have 20% to 40% of the data
and so on.

When a search is queries , the query hits all shards , and its results
are aggregated to give the final result.

One shard should have atleast 1 copy of the whole documents of that
index

My doubt here is if we set 2 shards and 2 replica's , will there be 3 copy
of same index to the same shard (That is total 3 copies of the orginal data)
or total 3 copies of the whole index in the whole cluster ?

For each shard, you will find one primary and two replicas in your cluster.

As you have 2 shards, you will have 50% of your docs in the first shard
(with 2 copies) and 50 % in the second shard (with 2 copies).

If you have 100 docs using 10 Mb

You will have 30 Mb used in your cluster (if you have enough nodes).

If you have 1 node, your cluster will be yellow and you will have shard0
with 5 Mb and shard1 with 5Mb

If you have 2 nodes, your cluster will be yellow and you will have shard0
primary, shard 1 replica in the first node using 5 Mb each, and shard0
replica and shard 1 primary in the second node using 5 Mb each. So you will
use 20 Mb in your cluster

If you have 3 nodes, your cluster will be green and you will have something
like shard0 primary, shard 1 replica in the first node using 5 Mb each, and
shard0 replica and shard 1 primary in the second node using 5 Mb each and
shard0 replica and shard 1 replica in the third node using 5 Mb each. So you
will use 30 Mb in your cluster

As far as i have explored ES , what i have understood is

A replica of index will have whole documents and it wud b stored in
that box.

A shard is a lucene object which holds a part of the whole index.
That is if the number of shards to a index is 5 , first shard will have the
first 20% of the index data , second one will have 20% to 40% of the data
and so on.

When a search is queries , the query hits all shards , and its
results are aggregated to give the final result.

One shard should have atleast 1 copy of the whole documents of that
index

My doubt here is if we set 2 shards and 2 replica's , will there be 3 copy
of same index to the same shard (That is total 3 copies of the orginal data)
or total 3 copies of the whole index in the whole cluster ?

That helps a lot David.
Your post is really a eye opener.

Thanks
VIneeth

On Fri, Oct 21, 2011 at 12:44 AM, David Pilato david@pilato.fr wrote:

For each shard, you will find one primary and two replicas in your cluster.

As you have 2 shards, you will have 50% of your docs in the first shard
(with 2 copies) and 50 % in the second shard (with 2 copies).****

If you have 100 docs using 10 Mb****

You will have 30 Mb used in your cluster (if you have enough nodes).****

If you have 1 node, your cluster will be yellow and you will have shard0
with 5 Mb and shard1 with 5Mb****

If you have 2 nodes, your cluster will be yellow and you will have shard0
primary, shard 1 replica in the first node using 5 Mb each, and shard0
replica and shard 1 primary in the second node using 5 Mb each. So you will
use 20 Mb in your cluster****

If you have 3 nodes, your cluster will be green and you will have something
like shard0 primary, shard 1 replica in the first node using 5 Mb each, and
shard0 replica and shard 1 primary in the second node using 5 Mb each and
shard0 replica and shard 1 replica in the third node using 5 Mb each. So you
will use 30 Mb in your cluster ****

As far as i have explored ES , what i have understood is ****

A replica of index will have whole documents and it wud b stored in
that box.****

A shard is a lucene object which holds a part of the whole index.
That is if the number of shards to a index is 5 , first shard will have the
first 20% of the index data , second one will have 20% to 40% of the data
and so on.****

When a search is queries , the query hits all shards , and its
results are aggregated to give the final result.****

One shard should have atleast 1 copy of the whole documents of that
index****

My doubt here is if we set 2 shards and 2 replica's , will there be 3 copy
of same index to the same shard (That is total 3 copies of the orginal data)
or total 3 copies of the whole index in the whole cluster ?

I just saw in some posts that once number of shards are set for a index , it
cant be changed.
Well if my search system take lotz of data after years of harvesting , i
might need to increase number of shards to increase performance.
How can i achieve that.

Also lets say there are just 2 machines with a instance of elasticSearch
each. Each have a shard with 0 replica. If that is the case ,
if one machine is dead , will my 50% data be lost ?

Another questions that comes to mind. For small set of data its better to
use 1 shard. In such cases can i put up a condition that IF ONLY document
size is more than N MB , re balance to the next shard , and again if its
only more than M MB (the 2 shards combined), use the next shard and so on.

That helps a lot David.
Your post is really a eye opener.

Thanks
VIneeth

On Fri, Oct 21, 2011 at 12:44 AM, David Pilato david@pilato.fr wrote:

For each shard, you will find one primary and two replicas in your
cluster.****

As you have 2 shards, you will have 50% of your docs in the first shard
(with 2 copies) and 50 % in the second shard (with 2 copies).****

If you have 100 docs using 10 Mb****

You will have 30 Mb used in your cluster (if you have enough nodes).****

If you have 1 node, your cluster will be yellow and you will have shard0
with 5 Mb and shard1 with 5Mb****

If you have 2 nodes, your cluster will be yellow and you will have shard0
primary, shard 1 replica in the first node using 5 Mb each, and shard0
replica and shard 1 primary in the second node using 5 Mb each. So you will
use 20 Mb in your cluster****

If you have 3 nodes, your cluster will be green and you will have
something like shard0 primary, shard 1 replica in the first node using 5 Mb
each, and shard0 replica and shard 1 primary in the second node using 5 Mb
each and shard0 replica and shard 1 replica in the third node using 5 Mb
each. So you will use 30 Mb in your cluster ****

As far as i have explored ES , what i have understood is ****

A replica of index will have whole documents and it wud b stored in
that box.****

A shard is a lucene object which holds a part of the whole index.
That is if the number of shards to a index is 5 , first shard will have the
first 20% of the index data , second one will have 20% to 40% of the data
and so on.****

When a search is queries , the query hits all shards , and its
results are aggregated to give the final result.****

One shard should have atleast 1 copy of the whole documents of that
index****

My doubt here is if we set 2 shards and 2 replica's , will there be 3 copy
of same index to the same shard (That is total 3 copies of the orginal data)
or total 3 copies of the whole index in the whole cluster ?

I just saw in some posts that once number of shards are set for a index ,
it cant be changed.
Well if my search system take lotz of data after years of harvesting , i
might need to increase number of shards to increase performance.
How can i achieve that.

Yes, you can't change the number of shards. There are ways around that,
including creating "more" shards at start, something like 20 (which will
take you, size wise, to 20 machine capacity), but start with 3 machines.
Another is to use several indices (you can search across them). Routing can
also come in play to constrain searches to specific shards in an index with
many shards.

Also lets say there are just 2 machines with a instance of elasticSearch
each. Each have a shard with 0 replica. If that is the case ,
if one machine is dead , will my 50% data be lost ?

It will not be lost if you can bring that machine back with the data it
held.

Another questions that comes to mind. For small set of data its better to
use 1 shard. In such cases can i put up a condition that IF ONLY document
size is more than N MB , re balance to the next shard , and again if its
only more than M MB (the 2 shards combined), use the next shard and so on.

You can do that by simply using several indices as you want, and "add"
indices later on. You can always search on more than one index.

That helps a lot David.
Your post is really a eye opener.

Thanks
VIneeth

On Fri, Oct 21, 2011 at 12:44 AM, David Pilato david@pilato.fr wrote:

For each shard, you will find one primary and two replicas in your
cluster.****

As you have 2 shards, you will have 50% of your docs in the first shard
(with 2 copies) and 50 % in the second shard (with 2 copies).****

If you have 100 docs using 10 Mb****

You will have 30 Mb used in your cluster (if you have enough nodes).****

If you have 1 node, your cluster will be yellow and you will have shard0
with 5 Mb and shard1 with 5Mb****

If you have 2 nodes, your cluster will be yellow and you will have shard0
primary, shard 1 replica in the first node using 5 Mb each, and shard0
replica and shard 1 primary in the second node using 5 Mb each. So you will
use 20 Mb in your cluster****

If you have 3 nodes, your cluster will be green and you will have
something like shard0 primary, shard 1 replica in the first node using 5 Mb
each, and shard0 replica and shard 1 primary in the second node using 5 Mb
each and shard0 replica and shard 1 replica in the third node using 5 Mb
each. So you will use 30 Mb in your cluster ****

As far as i have explored ES , what i have understood is ****

A replica of index will have whole documents and it wud b stored in
that box.****

A shard is a lucene object which holds a part of the whole index.
That is if the number of shards to a index is 5 , first shard will have the
first 20% of the index data , second one will have 20% to 40% of the data
and so on.****

When a search is queries , the query hits all shards , and its
results are aggregated to give the final result.****

One shard should have atleast 1 copy of the whole documents of that
index****

My doubt here is if we set 2 shards and 2 replica's , will there be 3
copy of same index to the same shard (That is total 3 copies of the orginal
data) or total 3 copies of the whole index in the whole cluster ?

Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.