I've started working on elasticsearch and having some doubts about shards
and replicas and how they handle data. I don't have any prior knowledge on
Lucene.
As I know lucene will split data in segments and store in disk, and shard
is the lucene index itself. Some of the doubts which I have is...
There are two way we can do shard allocation, one in cluster level with
config settings and another in index level settings. Suppose in cluster
level I mentioned max shard is 3 and in index level I mentioned 5 shards,
how the shards will be allocated? I have one cluster one node.
Suppose, one index is having 5 shards and 2 replicas and I'm pushing
data in bulk api, how the data will be stored? Is same data will be stored
in 5 shards or the data will split and store in chunks in 5 shards? How
replicas will have backup of data of all 5 shards?
Suppose I have 5 nodes and 10 shards are distributed over the nodes, 2
shards each. So when I index new documents how the data will be stored in
over the nodes?
Suppose the 5th node goes down suddenly which is holding 9th and 10th
shard. Now do I loose all the data stored in 9th and 10th shard or the data
are already copied in rest of the nodes ?
1 - it will use 5 as you've specifically set that, anything you don't
specify will just use the cluster default.
2 - The data isn't replicated into all the shards, it splits the complete
data up into 5 shards. Then each replica set will contain a copy of the
data, which is then also sharded.
3 - ES will distribute data across all shards as best it can. It will also
not store the replicas for a shard on the same node as the primary. So if a
node that holds a primary shard dies, thena secondary shard will be
promoted to primary.
I've started working on elasticsearch and having some doubts about shards
and replicas and how they handle data. I don't have any prior knowledge on
Lucene.
As I know lucene will split data in segments and store in disk, and shard
is the lucene index itself. Some of the doubts which I have is...
There are two way we can do shard allocation, one in cluster level with
config settings and another in index level settings. Suppose in cluster
level I mentioned max shard is 3 and in index level I mentioned 5 shards,
how the shards will be allocated? I have one cluster one node.
Suppose, one index is having 5 shards and 2 replicas and I'm pushing
data in bulk api, how the data will be stored? Is same data will be stored
in 5 shards or the data will split and store in chunks in 5 shards? How
replicas will have backup of data of all 5 shards?
Suppose I have 5 nodes and 10 shards are distributed over the nodes, 2
shards each. So when I index new documents how the data will be stored in
over the nodes?
Suppose the 5th node goes down suddenly which is holding 9th and 10th
shard. Now do I loose all the data stored in 9th and 10th shard or the data
are already copied in rest of the nodes ?
Thanks Mark for the prompt reply, I have some more doubts
Suppose one index is running with 3 shards and 1 replica and other index
is running with the cluster settings i.e. 5 shards 2 replica then total 3+1
or 5+2 shards will be available in cluster? I have installed
elasticsearch-head plugin but the replica shard is not showing there.
For data distribution, replica shard also keeps other index documents or it
will be used to keep backup copy of data only.
So documents under same index will be split due to sharding and
distribute over the shards right ? Can we push all the documents for same
index in a particular shard? I don't want to use custom routing as then I
need one field value common for all the documents. How can we find out
which shard is holding which documents?
If I make one index with 2 shards and no replica and the node in cluster
holding this 2 shards dies, then will I lose the data, or the data will
have a copy in cluster level replica? If I have only 1 replica and the node
holds the replica dies then how the backup will happen?
1 - Data from both will be available, you've just told ES not to use the
defaults for one index. A replica is not a backup, it's a 1:1 replica so it
will contain the same data as the primary shard.
2 - Not sure, but I don't think so as lucene will try to split things.
Routing is the recommended method for what you want.
3 - Yes, although you are unlikely to have them both on one node unless it
is a single node cluster. What do mean by backup? If you're talking about
replicas instead then the cluster will build a new replica if one dies.
Thanks Mark for the prompt reply, I have some more doubts
Suppose one index is running with 3 shards and 1 replica and other
index is running with the cluster settings i.e. 5 shards 2 replica then
total 3+1 or 5+2 shards will be available in cluster? I have installed
elasticsearch-head plugin but the replica shard is not showing there.
For data distribution, replica shard also keeps other index documents or
it will be used to keep backup copy of data only.
So documents under same index will be split due to sharding and
distribute over the shards right ? Can we push all the documents for same
index in a particular shard? I don't want to use custom routing as then I
need one field value common for all the documents. How can we find out
which shard is holding which documents?
If I make one index with 2 shards and no replica and the node in
cluster holding this 2 shards dies, then will I lose the data, or the data
will have a copy in cluster level replica? If I have only 1 replica and
the node holds the replica dies then how the backup will happen?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.