Hi,
I've read the online guides and searched for past threads. I'm
trying to get a clear definition of the following terms but having
trouble.
Node - I think this means an instance of ES running on a server.
Index - I know what this means in Lucene, but is it identical in ES
or is there more? e.g. logical definition? or physical?
Shard - I see that indices are created with 'shards'. But what IS a
shard and why does it exist? When I have an index with 5 shards, does
that mean 5 physical lucene indices that are separate?
Replica - I read that a shard has a replica. How does that work? If
I have 5 shards, I need 5 replicas for each shard as failover data? Or
1 replica merges the data of 5 shards into itself?
Maybe there is a simple diagram somewhere showing these relationships.
Reading the API docs to learn these relationships is rather hard. Any
help appreciated.
On Wednesday, January 25, 2012 at 3:35 PM, project2501 wrote:
Hi,
I've read the online guides and searched for past threads. I'm
trying to get a clear definition of the following terms but having
trouble.
Node - I think this means an instance of ES running on a server.
Yes, and instance of elasticsearch.
Index - I know what this means in Lucene, but is it identical in ES
or is there more? e.g. logical definition? or physical?
An index has is a logical concept that encapsulates specific data. It it broken down into shards and shards can have replicas. It also hold mappings definition and specific settings associated with it.
Shard - I see that indices are created with 'shards'. But what IS a
shard and why does it exist? When I have an index with 5 shards, does
that mean 5 physical lucene indices that are separate?
Yes, and if you have replicas, then more.
Replica - I read that a shard has a replica. How does that work? If
I have 5 shards, I need 5 replicas for each shard as failover data? Or
1 replica merges the data of 5 shards into itself?
If you have an index with 5 shards and index.number_of_replicas set to 1, then each shard will have a single replica (two copies).
Maybe there is a simple diagram somewhere showing these relationships.
Reading the API docs to learn these relationships is rather hard. Any
help appreciated.
So a shard is a 'physical' index, in the lucene sense? And an ES index
is broken down into multiple physical indexes for some reason?
Performance? scaling?
On Wednesday, January 25, 2012 at 3:35 PM, project2501 wrote:
Hi,
I've read the online guides and searched for past threads. I'm
trying to get a clear definition of the following terms but having
trouble.
Node - I think this means an instance of ES running on a server.
Yes, and instance of elasticsearch.
Index - I know what this means in Lucene, but is it identical in ES
or is there more? e.g. logical definition? or physical?
An index has is a logical concept that encapsulates specific data. It it broken down into shards and shards can have replicas. It also hold mappings definition and specific settings associated with it.
Shard - I see that indices are created with 'shards'. But what IS a
shard and why does it exist? When I have an index with 5 shards, does
that mean 5 physical lucene indices that are separate?
Yes, and if you have replicas, then more.
Replica - I read that a shard has a replica. How does that work? If
I have 5 shards, I need 5 replicas for each shard as failover data? Or
1 replica merges the data of 5 shards into itself?
If you have an index with 5 shards and index.number_of_replicas set to 1, then each shard will have a single replica (two copies).
Maybe there is a simple diagram somewhere showing these relationships.
Reading the API docs to learn these relationships is rather hard. Any
help appreciated.
If an index is broken down to 5 shards then it means that each shard can be
allocated on different physical machine (and thus each shard can grow up to
the capabilities of given machine). If you were to sum up all 5 shards you
would get a combined index that would not fit into any machine out of those
five. So yes, this help scalability as well as performance because all
shards can be processed (for example searched) concurrently.
Just note that elasticsearch has the distributed notion in its DNA since
the very beginning so everything in it is about distributed, concurrent and
possibly [near] real time processing (including index and search
operations). That is why you need index sharding and many other concepts
found in it.
So a shard is a 'physical' index, in the lucene sense? And an ES index
is broken down into multiple physical indexes for some reason?
Performance? scaling?
On Wednesday, January 25, 2012 at 3:35 PM, project2501 wrote:
Hi,
I've read the online guides and searched for past threads. I'm
trying to get a clear definition of the following terms but having
trouble.
Node - I think this means an instance of ES running on a server.
Yes, and instance of elasticsearch.
Index - I know what this means in Lucene, but is it identical in ES
or is there more? e.g. logical definition? or physical?
An index has is a logical concept that encapsulates specific data. It it
broken down into shards and shards can have replicas. It also hold mappings
definition and specific settings associated with it.
Shard - I see that indices are created with 'shards'. But what IS a
shard and why does it exist? When I have an index with 5 shards, does
that mean 5 physical lucene indices that are separate?
Yes, and if you have replicas, then more.
Replica - I read that a shard has a replica. How does that work? If
I have 5 shards, I need 5 replicas for each shard as failover data? Or
1 replica merges the data of 5 shards into itself?
If you have an index with 5 shards and index.number_of_replicas set to
1, then each shard will have a single replica (two copies).
Maybe there is a simple diagram somewhere showing these relationships.
Reading the API docs to learn these relationships is rather hard. Any
help appreciated.
So if I have a cluster of 100 machines and I want to have an index for
'documents'.
Would I have 1 'document' index and 100 shards? Then the elasticsearch
nodes discover
each other and decide how to distribute the load among them?
If an index is broken down to 5 shards then it means that each shard can be
allocated on different physical machine (and thus each shard can grow up to
the capabilities of given machine). If you were to sum up all 5 shards you
would get a combined index that would not fit into any machine out of those
five. So yes, this help scalability as well as performance because all
shards can be processed (for example searched) concurrently.
Just note that elasticsearch has the distributed notion in its DNA since
the very beginning so everything in it is about distributed, concurrent and
possibly [near] real time processing (including index and search
operations). That is why you need index sharding and many other concepts
found in it.
So a shard is a 'physical' index, in the lucene sense? And an ES index
is broken down into multiple physical indexes for some reason?
Performance? scaling?
On Wednesday, January 25, 2012 at 3:35 PM, project2501 wrote:
Hi,
I've read the online guides and searched for past threads. I'm
trying to get a clear definition of the following terms but having
trouble.
Node - I think this means an instance of ES running on a server.
Yes, and instance of elasticsearch.
Index - I know what this means in Lucene, but is it identical in ES
or is there more? e.g. logical definition? or physical?
An index has is a logical concept that encapsulates specific data. It it
broken down into shards and shards can have replicas. It also hold mappings
definition and specific settings associated with it.
Shard - I see that indices are created with 'shards'. But what IS a
shard and why does it exist? When I have an index with 5 shards, does
that mean 5 physical lucene indices that are separate?
Yes, and if you have replicas, then more.
Replica - I read that a shard has a replica. How does that work? If
I have 5 shards, I need 5 replicas for each shard as failover data? Or
1 replica merges the data of 5 shards into itself?
If you have an index with 5 shards and index.number_of_replicas set to
1, then each shard will have a single replica (two copies).
Maybe there is a simple diagram somewhere showing these relationships.
Reading the API docs to learn these relationships is rather hard. Any
help appreciated.
So if I have a cluster of 100 machines and I want to have an index for
'documents'.
Would I have 1 'document' index and 100 shards? Then the elasticsearch
nodes discover
each other and decide how to distribute the load among them?
If an index is broken down to 5 shards then it means that each shard can
be
allocated on different physical machine (and thus each shard can grow up
to
the capabilities of given machine). If you were to sum up all 5 shards
you
would get a combined index that would not fit into any machine out of
those
five. So yes, this help scalability as well as performance because all
shards can be processed (for example searched) concurrently.
Just note that elasticsearch has the distributed notion in its DNA since
the very beginning so everything in it is about distributed, concurrent
and
possibly [near] real time processing (including index and search
operations). That is why you need index sharding and many other concepts
found in it.
So a shard is a 'physical' index, in the lucene sense? And an ES index
is broken down into multiple physical indexes for some reason?
Performance? scaling?
On Wednesday, January 25, 2012 at 3:35 PM, project2501 wrote:
Hi,
I've read the online guides and searched for past threads. I'm
trying to get a clear definition of the following terms but having
trouble.
Node - I think this means an instance of ES running on a server.
Yes, and instance of elasticsearch.
Index - I know what this means in Lucene, but is it identical
in ES
or is there more? e.g. logical definition? or physical?
An index has is a logical concept that encapsulates specific data.
It it
broken down into shards and shards can have replicas. It also hold
mappings
definition and specific settings associated with it.
Shard - I see that indices are created with 'shards'. But what
IS a
shard and why does it exist? When I have an index with 5 shards,
does
that mean 5 physical lucene indices that are separate?
Yes, and if you have replicas, then more.
Replica - I read that a shard has a replica. How does that
work? If
I have 5 shards, I need 5 replicas for each shard as failover
data? Or
1 replica merges the data of 5 shards into itself?
If you have an index with 5 shards and index.number_of_replicas set
to
1, then each shard will have a single replica (two copies).
Maybe there is a simple diagram somewhere showing these
relationships.
Reading the API docs to learn these relationships is rather hard.
Any
help appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.