Shards and replicas

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Heya,

In elasticsearch, when you create an index, you define the number of shards and number of replicas. By default, an index is created with 5 shards and 1 replica per shard. If you have 2 nodes, and create a single index, with 5 shards and 1 replica, then each node will have 5 shards. Once you start adding more nodes, those shards will start to get rebalanced between those nodes.

Thats the gist of it in a very high level manner. One way you can understand it possibly better is to simply use the cluster state API, it gives a nice breakdown of indices, shards, and where they are allocated and on which nodes. You can easily create indices and start several servers on your laptop and see how it behaves.

Back to your question. This does not sound like a large set of documents. If you did not play with changing the number of shards and replica, you have nice growth path of up to 10 nodes (1 shard per node), size wise you can grow up to 5 shards (not counting replicas) which is up to 5 servers/nodes.

-shay.banon
On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Thanks for the reply.




However I still do not understand how I should configure
shards/replicas in respect to different scenarios. For instance if I
use a single node and end up with 5 shards in it. Would it not be
better to use a setup of 1 shard / 1 replica? All is on the same
node anyways?


Is it better to split the index in many shards and balance those on
several nodes? I guess network protocols will play a role as a limit
in that scenario. And how does replicas play in? In what way does
several replicas affect the situation. Are they for redundance? I.e
if a node goes down a *replica* of the shard on another node still
serves the index? If that is correct, then what is a good ratio
between shards/replicas for a set number of servers?


Many questions... :)




Back to the index I am building. Yes, it is a small index in the
Lucene/ES world. But I expect to add several of the same magnitude
to the mix as time goes.


And I have another application (for library catalogues) which will
index up to a million documents per index, but there each document
is rather small.




/Kristian




Shay Banon skrev 2011-04-15 00:38:
<blockquote cite="mid:8409BC350CEB412E8639FF8BF6D9E92A@elasticsearch.com" type="cite">Heya,

  In elasticsearch, when you create an index, you
define the number of shards and number of replicas. By
default, an index is created with 5 shards and 1 replica per
shard. If you have 2 nodes, and create a single index, with
5 shards and 1 replica, then each node will have 5 shards.
Once you start adding more nodes, those shards will start to
get rebalanced between those nodes.

  Thats the gist of it in a very high level manner.
One way you can understand it possibly better is to simply
use the cluster state API, it gives a nice breakdown of
indices, shards, and where they are allocated and on which
nodes. You can easily create indices and start several
servers on your laptop and see how it behaves.

  Back to your question. This does not sound like a
large set of documents. If you did not play with changing
the number of shards and replica, you have nice growth path
of up to 10 nodes (1 shard per node), size wise you can grow
up to 5 shards (not counting replicas) which is up to 5
servers/nodes.

-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
              goes.


              The first time ever I came by the names shards and
              replicas was when I 


              started looking into new search engines for my web
              applications and 


              ended up here with ES. I understand these terms
              describe how the index 


              is divided between instances of nodes in a cluster,
              but I would really 


              need a deep understanding of the concepts. Especially
              as we are now 


              beginning to put ES in production and will actually
              start using two 


              nodes as a cluster.




              So any good pointers to info on these concepts would
              be much 


              appreciated. I tried googling but there was too much
              noise and *nothing* 


              to actually explain the concepts. First time I came up
              empty with google...




              I am about to put an index of about 50.000 documents
              (each document an 


              OCR interpreted page of a book) on a cluster of two
              servers. What would 


              be a good setting for shards and replicas for this
              type of index and 


              cluster?
-- 
Med vänlig hälsning
Kristian Jörg

Devo IT AB
Tel: 054 - 22 14 58, 0709 - 15 83 42
E-post: kristian.jorg@devo.se
Webb: http://www.devo.se

Shard/replica configuration depends on your needs. There are many factors,
how many docs, users, number of queries, etc. 50K documents is not much so
you're probably OK with one server unless you have very high number of
users/queries.
Replica is for redundancy but also for improves query performance. If you
need high availability, you should have 2 servers naturally. In this case, 5
shards and 1 replica (note 1 replica means 1 additional copy, total of 2
copies) would be reasonable.
If you will have only 1 server, then 5 shards and no replica is better, you
can always add a replica later, if needed. Adding shards later is not that
easy so even with 1 server, I'd keep number of shards higher. If your memory
is really limited and you don't expect expansion then you can limit the
number of shards to 2.

As you can see, there are a lot of if, then, unless in the statements above.
I'd recommend playing around with it, see how the performance is and adjust
accordingly, rather than trying to figure out everything from the beginning.
It would not be difficult to change your config and re-index 50K docs

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian Jörg krjg@devo.se wrote:

Thanks for the reply.

However I still do not understand how I should configure shards/replicas in
respect to different scenarios. For instance if I use a single node and end
up with 5 shards in it. Would it not be better to use a setup of 1 shard / 1
replica? All is on the same node anyways?
Is it better to split the index in many shards and balance those on several
nodes? I guess network protocols will play a role as a limit in that
scenario. And how does replicas play in? In what way does several replicas
affect the situation. Are they for redundance? I.e if a node goes down a
replica of the shard on another node still serves the index? If that is
correct, then what is a good ratio between shards/replicas for a set number
of servers?
Many questions... :slight_smile:

Back to the index I am building. Yes, it is a small index in the Lucene/ES
world. But I expect to add several of the same magnitude to the mix as time
goes.
And I have another application (for library catalogues) which will index up
to a million documents per index, but there each document is rather small.

/Kristian

Shay Banon skrev 2011-04-15 00:38:

Heya,

In elasticsearch, when you create an index, you define the number of
shards and number of replicas. By default, an index is created with 5 shards
and 1 replica per shard. If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will have 5 shards. Once you
start adding more nodes, those shards will start to get rebalanced between
those nodes.

Thats the gist of it in a very high level manner. One way you can
understand it possibly better is to simply use the cluster state API, it
gives a nice breakdown of indices, shards, and where they are allocated and
on which nodes. You can easily create indices and start several servers on
your laptop and see how it behaves.

Back to your question. This does not sound like a large set of

documents. If you did not play with changing the number of shards and
replica, you have nice growth path of up to 10 nodes (1 shard per node),
size wise you can grow up to 5 shards (not counting replicas) which is up to
5 servers/nodes.

-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

--
Med vänlig hälsning
Kristian Jörg

Devo IT AB
Tel: 054 - 22 14 58, 0709 - 15 83 42
E-post: kristian.jorg@devo.se
Webb: http://www.devo.se

Hi,




thanks for all information everybody! 


I am now getting the grasp on this and feel confident I can manage
the configuration from here on.


  


/Kristian




Berkay Mollamustafaoglu skrev 2011-04-15 14:24:
<blockquote cite="mid:BANLkTikrxkHckT1j6WcauLaemVRpciOqhg@mail.gmail.com" type="cite">Shard/replica configuration depends on your needs. There are
    many factors, how many docs, users, number of queries, etc.  50K
    documents is not much so you're probably OK with one server
    unless you have very high number of users/queries. Replica is for redundancy but also for improves query
    performance. If you need high availability, you should have 2
    servers naturally. In this case, 5 shards and 1 replica (note 1
    replica means 1 additional copy, total of 2 copies) would be
    reasonable. If you will have only 1 server, then 5 shards and no replica
    is better, you can always add a replica later, if needed. Adding
    shards later is not that easy so even with 1 server, I'd keep
    number of shards higher. If your memory is really limited and
    you don't expect expansion then you can limit the number of
    shards to 2.

As you can see, there are a lot of if, then, unless in the
statements above. I'd recommend playing around with it, see how
the performance is and adjust accordingly, rather than trying to
figure out everything from the beginning. It would not be
difficult to change your config and re-index 50K docs

  Regards,


  Berkay Mollamustafaoglu


  mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian
Jörg <krjg@devo.se>
wrote:

Thanks for the reply.
        However I still do not understand how I should configure
        shards/replicas in respect to different scenarios. For
        instance if I use a single node and end up with 5 shards in
        it. Would it not be better to use a setup of 1 shard / 1
        replica? All is on the same node anyways?


        Is it better to split the index in many shards and balance
        those on several nodes? I guess network protocols will play
        a role as a limit in that scenario. And how does replicas
        play in? In what way does several replicas affect the
        situation. Are they for redundance? I.e if a node goes down
        a *replica* of the shard on another node still serves the
        index? If that is correct, then what is a good ratio between
        shards/replicas for a set number of servers?


        Many questions... :)




        Back to the index I am building. Yes, it is a small index in
        the Lucene/ES world. But I expect to add several of the same
        magnitude to the mix as time goes.


        And I have another application (for library catalogues)
        which will index up to a million documents per index, but
        there each document is rather small.




        /Kristian




        Shay Banon skrev 2011-04-15 00:38:
        <blockquote type="cite">Heya,

In elasticsearch, when you create an
index, you define the number of shards and
number of replicas. By default, an index is
created with 5 shards and 1 replica per shard.
If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will
have 5 shards. Once you start adding more nodes,
those shards will start to get rebalanced
between those nodes.

Thats the gist of it in a very high
level manner. One way you can understand it
possibly better is to simply use the cluster
state API, it gives a nice breakdown of indices,
shards, and where they are allocated and on
which nodes. You can easily create indices and
start several servers on your laptop and see how
it behaves.

Back to your question. This does not
sound like a large set of documents. If you did
not play with changing the number of shards and
replica, you have nice growth path of up to 10
nodes (1 shard per node), size wise you can grow
up to 5 shards (not counting replicas) which is
up to 5 servers/nodes.

-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
                          goes.


                          The first time ever I came by the names
                          shards and replicas was when I 


                          started looking into new search engines
                          for my web applications and 


                          ended up here with ES. I understand these
                          terms describe how the index 


                          is divided between instances of nodes in a
                          cluster, but I would really 


                          need a deep understanding of the concepts.
                          Especially as we are now 


                          beginning to put ES in production and will
                          actually start using two 


                          nodes as a cluster.




                          So any good pointers to info on these
                          concepts would be much 


                          appreciated. I tried googling but there
                          was too much noise and *nothing* 


                          to actually explain the concepts. First
                          time I came up empty with google...




                          I am about to put an index of about 50.000
                          documents (each document an 


                          OCR interpreted page of a book) on a
                          cluster of two servers. What would 


                          be a good setting for shards and replicas
                          for this type of index and 


                          cluster?

Hi,
is it possible to change the number of replicas dynamically? Lets say
I index my data into different indices. After every day I switch to a
new index.
If I know from my usage scenario, that the most recent data is the
data my users search for mostly and the older data is needed less
often, would it be a good idea to change
the replica settings on the fly, for instance start with 2 replicas
during the first week, then down to 1 replica for the second week, and
to no replica at all for data older than a month.

There are some open questions where I didn't find an answer for:

  • If I have no replica and I lose a node, what will happen to the
    cluster? Will queriing still work (without the lost data)?
  • If I start with 2 replicas and I go down to 1, will the data still
    stay on the hdd on the third node, so that I can recover it in a case
    off an HDD failure?
  • In order to speed up queriing, what are the prerequesites to use an
    im memory index? Is it possible to keep a backup of the data on HDD
    and use an inmemory index without replica?

Regards,
Michel

On Tue, Apr 19, 2011 at 11:14 AM, Kristian Jörg krjg@devo.se wrote:

Hi,

thanks for all information everybody!
I am now getting the grasp on this and feel confident I can manage the
configuration from here on.

/Kristian

Berkay Mollamustafaoglu skrev 2011-04-15 14:24:

Shard/replica configuration depends on your needs. There are many factors,
how many docs, users, number of queries, etc. 50K documents is not much so
you're probably OK with one server unless you have very high number of
users/queries.
Replica is for redundancy but also for improves query performance. If you
need high availability, you should have 2 servers naturally. In this case, 5
shards and 1 replica (note 1 replica means 1 additional copy, total of 2
copies) would be reasonable.
If you will have only 1 server, then 5 shards and no replica is better, you
can always add a replica later, if needed. Adding shards later is not that
easy so even with 1 server, I'd keep number of shards higher. If your memory
is really limited and you don't expect expansion then you can limit the
number of shards to 2.
As you can see, there are a lot of if, then, unless in the statements above.
I'd recommend playing around with it, see how the performance is and adjust
accordingly, rather than trying to figure out everything from the beginning.
It would not be difficult to change your config and re-index 50K docs
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian Jörg krjg@devo.se wrote:

Thanks for the reply.

However I still do not understand how I should configure shards/replicas
in respect to different scenarios. For instance if I use a single node and
end up with 5 shards in it. Would it not be better to use a setup of 1 shard
/ 1 replica? All is on the same node anyways?
Is it better to split the index in many shards and balance those on
several nodes? I guess network protocols will play a role as a limit in that
scenario. And how does replicas play in? In what way does several replicas
affect the situation. Are they for redundance? I.e if a node goes down a
replica of the shard on another node still serves the index? If that is
correct, then what is a good ratio between shards/replicas for a set number
of servers?
Many questions... :slight_smile:

Back to the index I am building. Yes, it is a small index in the Lucene/ES
world. But I expect to add several of the same magnitude to the mix as time
goes.
And I have another application (for library catalogues) which will index
up to a million documents per index, but there each document is rather
small.

/Kristian

Shay Banon skrev 2011-04-15 00:38:

Heya,
In elasticsearch, when you create an index, you define the number of
shards and number of replicas. By default, an index is created with 5 shards
and 1 replica per shard. If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will have 5 shards. Once you
start adding more nodes, those shards will start to get rebalanced between
those nodes.
Thats the gist of it in a very high level manner. One way you can
understand it possibly better is to simply use the cluster state API, it
gives a nice breakdown of indices, shards, and where they are allocated and
on which nodes. You can easily create indices and start several servers on
your laptop and see how it behaves.
Back to your question. This does not sound like a large set of
documents. If you did not play with changing the number of shards and
replica, you have nice growth path of up to 10 nodes (1 shard per node),
size wise you can grow up to 5 shards (not counting replicas) which is up to
5 servers/nodes.
-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with
google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Yes, you can change the number of replicas dynamically, check teh update settings API. More answers below:
On Tuesday, April 19, 2011 at 2:11 PM, Michel Conrad wrote:

Hi,
is it possible to change the number of replicas dynamically? Lets say
I index my data into different indices. After every day I switch to a
new index.
If I know from my usage scenario, that the most recent data is the
data my users search for mostly and the older data is needed less
often, would it be a good idea to change
the replica settings on the fly, for instance start with 2 replicas
during the first week, then down to 1 replica for the second week, and
to no replica at all for data older than a month.
Note, if you have no replicas, and you loose a node that had a shard for the index with no replicas, you will loose that data.

There are some open questions where I didn't find an answer for:

  • If I have no replica and I lose a node, what will happen to the
    cluster? Will queriing still work (without the lost data)?
    Querying will still work on whatever data is available, but, that relevant data is lost.
  • If I start with 2 replicas and I go down to 1, will the data still
    stay on the hdd on the third node, so that I can recover it in a case
    off an HDD failure?
    If you reduce to less replicas, then it won't exist on disk. You will now have 2 copies of the data instead of 3.
  • In order to speed up queriing, what are the prerequesites to use an
    im memory index? Is it possible to keep a backup of the data on HDD
    and use an inmemory index without replica?
    I suggest first to check the file based index, with file system cache, it will be pretty fast.

Regards,
Michel

On Tue, Apr 19, 2011 at 11:14 AM, Kristian Jörg krjg@devo.se wrote:

Hi,

thanks for all information everybody!
I am now getting the grasp on this and feel confident I can manage the
configuration from here on.

/Kristian

Berkay Mollamustafaoglu skrev 2011-04-15 14:24:

Shard/replica configuration depends on your needs. There are many factors,
how many docs, users, number of queries, etc. 50K documents is not much so
you're probably OK with one server unless you have very high number of
users/queries.
Replica is for redundancy but also for improves query performance. If you
need high availability, you should have 2 servers naturally. In this case, 5
shards and 1 replica (note 1 replica means 1 additional copy, total of 2
copies) would be reasonable.
If you will have only 1 server, then 5 shards and no replica is better, you
can always add a replica later, if needed. Adding shards later is not that
easy so even with 1 server, I'd keep number of shards higher. If your memory
is really limited and you don't expect expansion then you can limit the
number of shards to 2.
As you can see, there are a lot of if, then, unless in the statements above.
I'd recommend playing around with it, see how the performance is and adjust
accordingly, rather than trying to figure out everything from the beginning.
It would not be difficult to change your config and re-index 50K docs
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian Jörg krjg@devo.se wrote:

Thanks for the reply.

However I still do not understand how I should configure shards/replicas
in respect to different scenarios. For instance if I use a single node and
end up with 5 shards in it. Would it not be better to use a setup of 1 shard
/ 1 replica? All is on the same node anyways?
Is it better to split the index in many shards and balance those on
several nodes? I guess network protocols will play a role as a limit in that
scenario. And how does replicas play in? In what way does several replicas
affect the situation. Are they for redundance? I.e if a node goes down a
replica of the shard on another node still serves the index? If that is
correct, then what is a good ratio between shards/replicas for a set number
of servers?
Many questions... :slight_smile:

Back to the index I am building. Yes, it is a small index in the Lucene/ES
world. But I expect to add several of the same magnitude to the mix as time
goes.
And I have another application (for library catalogues) which will index
up to a million documents per index, but there each document is rather
small.

/Kristian

Shay Banon skrev 2011-04-15 00:38:

Heya,
In elasticsearch, when you create an index, you define the number of
shards and number of replicas. By default, an index is created with 5 shards
and 1 replica per shard. If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will have 5 shards. Once you
start adding more nodes, those shards will start to get rebalanced between
those nodes.
Thats the gist of it in a very high level manner. One way you can
understand it possibly better is to simply use the cluster state API, it
gives a nice breakdown of indices, shards, and where they are allocated and
on which nodes. You can easily create indices and start several servers on
your laptop and see how it behaves.
Back to your question. This does not sound like a large set of
documents. If you did not play with changing the number of shards and
replica, you have nice growth path of up to 10 nodes (1 shard per node),
size wise you can grow up to 5 shards (not counting replicas) which is up to
5 servers/nodes.
-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with
google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Hi Shay,

thank you for your answers. The only thing that is not clear to me is
in case that I have an index
without replicas and I lose a shard, the state will go from green to
red, as yellow cannot happen
when there are no replicas, correct? Is it possible in that case to
tell the index that there will be
one shard less from now one, that the state will be green again. I am
asking because at one point
I am waiting for the cluster state to become yellow, and with an index
on red this will not be possible.

Best,
Michel

On Tue, Apr 19, 2011 at 1:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Yes, you can change the number of replicas dynamically, check teh update
settings API. More answers below:

On Tuesday, April 19, 2011 at 2:11 PM, Michel Conrad wrote:

Hi,
is it possible to change the number of replicas dynamically? Lets say
I index my data into different indices. After every day I switch to a
new index.
If I know from my usage scenario, that the most recent data is the
data my users search for mostly and the older data is needed less
often, would it be a good idea to change
the replica settings on the fly, for instance start with 2 replicas
during the first week, then down to 1 replica for the second week, and
to no replica at all for data older than a month.

Note, if you have no replicas, and you loose a node that had a shard for the
index with no replicas, you will loose that data.

There are some open questions where I didn't find an answer for:

  • If I have no replica and I lose a node, what will happen to the
    cluster? Will queriing still work (without the lost data)?

Querying will still work on whatever data is available, but, that relevant
data is lost.

  • If I start with 2 replicas and I go down to 1, will the data still
    stay on the hdd on the third node, so that I can recover it in a case
    off an HDD failure?

If you reduce to less replicas, then it won't exist on disk. You will now
have 2 copies of the data instead of 3.

  • In order to speed up queriing, what are the prerequesites to use an
    im memory index? Is it possible to keep a backup of the data on HDD
    and use an inmemory index without replica?

I suggest first to check the file based index, with file system cache, it
will be pretty fast.

Regards,
Michel

On Tue, Apr 19, 2011 at 11:14 AM, Kristian Jörg krjg@devo.se wrote:

Hi,

thanks for all information everybody!
I am now getting the grasp on this and feel confident I can manage the
configuration from here on.

/Kristian

Berkay Mollamustafaoglu skrev 2011-04-15 14:24:

Shard/replica configuration depends on your needs. There are many factors,
how many docs, users, number of queries, etc. 50K documents is not much so
you're probably OK with one server unless you have very high number of
users/queries.
Replica is for redundancy but also for improves query performance. If you
need high availability, you should have 2 servers naturally. In this case, 5
shards and 1 replica (note 1 replica means 1 additional copy, total of 2
copies) would be reasonable.
If you will have only 1 server, then 5 shards and no replica is better, you
can always add a replica later, if needed. Adding shards later is not that
easy so even with 1 server, I'd keep number of shards higher. If your memory
is really limited and you don't expect expansion then you can limit the
number of shards to 2.
As you can see, there are a lot of if, then, unless in the statements above.
I'd recommend playing around with it, see how the performance is and adjust
accordingly, rather than trying to figure out everything from the beginning.
It would not be difficult to change your config and re-index 50K docs
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian Jörg krjg@devo.se wrote:

Thanks for the reply.

However I still do not understand how I should configure shards/replicas
in respect to different scenarios. For instance if I use a single node and
end up with 5 shards in it. Would it not be better to use a setup of 1 shard
/ 1 replica? All is on the same node anyways?
Is it better to split the index in many shards and balance those on
several nodes? I guess network protocols will play a role as a limit in that
scenario. And how does replicas play in? In what way does several replicas
affect the situation. Are they for redundance? I.e if a node goes down a
replica of the shard on another node still serves the index? If that is
correct, then what is a good ratio between shards/replicas for a set number
of servers?
Many questions... :slight_smile:

Back to the index I am building. Yes, it is a small index in the Lucene/ES
world. But I expect to add several of the same magnitude to the mix as time
goes.
And I have another application (for library catalogues) which will index
up to a million documents per index, but there each document is rather
small.

/Kristian

Shay Banon skrev 2011-04-15 00:38:

Heya,
In elasticsearch, when you create an index, you define the number of
shards and number of replicas. By default, an index is created with 5 shards
and 1 replica per shard. If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will have 5 shards. Once you
start adding more nodes, those shards will start to get rebalanced between
those nodes.
Thats the gist of it in a very high level manner. One way you can
understand it possibly better is to simply use the cluster state API, it
gives a nice breakdown of indices, shards, and where they are allocated and
on which nodes. You can easily create indices and start several servers on
your laptop and see how it behaves.
Back to your question. This does not sound like a large set of
documents. If you did not play with changing the number of shards and
replica, you have nice growth path of up to 10 nodes (1 shard per node),
size wise you can grow up to 5 shards (not counting replicas) which is up to
5 servers/nodes.
-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with
google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Hi,

No, that index will remain in status red, and there is no way to tell the cluster that the index will no longer have this shard available. If you do manage to bring back the machine where that shard is stored, then it will be recovered.

-shay.banon
On Tuesday, April 19, 2011 at 3:34 PM, Michel Conrad wrote:

Hi Shay,

thank you for your answers. The only thing that is not clear to me is
in case that I have an index
without replicas and I lose a shard, the state will go from green to
red, as yellow cannot happen
when there are no replicas, correct? Is it possible in that case to
tell the index that there will be
one shard less from now one, that the state will be green again. I am
asking because at one point
I am waiting for the cluster state to become yellow, and with an index
on red this will not be possible.

Best,
Michel

On Tue, Apr 19, 2011 at 1:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Yes, you can change the number of replicas dynamically, check teh update
settings API. More answers below:

On Tuesday, April 19, 2011 at 2:11 PM, Michel Conrad wrote:

Hi,
is it possible to change the number of replicas dynamically? Lets say
I index my data into different indices. After every day I switch to a
new index.
If I know from my usage scenario, that the most recent data is the
data my users search for mostly and the older data is needed less
often, would it be a good idea to change
the replica settings on the fly, for instance start with 2 replicas
during the first week, then down to 1 replica for the second week, and
to no replica at all for data older than a month.

Note, if you have no replicas, and you loose a node that had a shard for the
index with no replicas, you will loose that data.

There are some open questions where I didn't find an answer for:

  • If I have no replica and I lose a node, what will happen to the
    cluster? Will queriing still work (without the lost data)?

Querying will still work on whatever data is available, but, that relevant
data is lost.

  • If I start with 2 replicas and I go down to 1, will the data still
    stay on the hdd on the third node, so that I can recover it in a case
    off an HDD failure?

If you reduce to less replicas, then it won't exist on disk. You will now
have 2 copies of the data instead of 3.

  • In order to speed up queriing, what are the prerequesites to use an
    im memory index? Is it possible to keep a backup of the data on HDD
    and use an inmemory index without replica?

I suggest first to check the file based index, with file system cache, it
will be pretty fast.

Regards,
Michel

On Tue, Apr 19, 2011 at 11:14 AM, Kristian Jörg krjg@devo.se wrote:

Hi,

thanks for all information everybody!
I am now getting the grasp on this and feel confident I can manage the
configuration from here on.

/Kristian

Berkay Mollamustafaoglu skrev 2011-04-15 14:24:

Shard/replica configuration depends on your needs. There are many factors,
how many docs, users, number of queries, etc. 50K documents is not much so
you're probably OK with one server unless you have very high number of
users/queries.
Replica is for redundancy but also for improves query performance. If you
need high availability, you should have 2 servers naturally. In this case, 5
shards and 1 replica (note 1 replica means 1 additional copy, total of 2
copies) would be reasonable.
If you will have only 1 server, then 5 shards and no replica is better, you
can always add a replica later, if needed. Adding shards later is not that
easy so even with 1 server, I'd keep number of shards higher. If your memory
is really limited and you don't expect expansion then you can limit the
number of shards to 2.
As you can see, there are a lot of if, then, unless in the statements above.
I'd recommend playing around with it, see how the performance is and adjust
accordingly, rather than trying to figure out everything from the beginning.
It would not be difficult to change your config and re-index 50K docs
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian Jörg krjg@devo.se wrote:

Thanks for the reply.

However I still do not understand how I should configure shards/replicas
in respect to different scenarios. For instance if I use a single node and
end up with 5 shards in it. Would it not be better to use a setup of 1 shard
/ 1 replica? All is on the same node anyways?
Is it better to split the index in many shards and balance those on
several nodes? I guess network protocols will play a role as a limit in that
scenario. And how does replicas play in? In what way does several replicas
affect the situation. Are they for redundance? I.e if a node goes down a
replica of the shard on another node still serves the index? If that is
correct, then what is a good ratio between shards/replicas for a set number
of servers?
Many questions... :slight_smile:

Back to the index I am building. Yes, it is a small index in the Lucene/ES
world. But I expect to add several of the same magnitude to the mix as time
goes.
And I have another application (for library catalogues) which will index
up to a million documents per index, but there each document is rather
small.

/Kristian

Shay Banon skrev 2011-04-15 00:38:

Heya,
In elasticsearch, when you create an index, you define the number of
shards and number of replicas. By default, an index is created with 5 shards
and 1 replica per shard. If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will have 5 shards. Once you
start adding more nodes, those shards will start to get rebalanced between
those nodes.
Thats the gist of it in a very high level manner. One way you can
understand it possibly better is to simply use the cluster state API, it
gives a nice breakdown of indices, shards, and where they are allocated and
on which nodes. You can easily create indices and start several servers on
your laptop and see how it behaves.
Back to your question. This does not sound like a large set of
documents. If you did not play with changing the number of shards and
replica, you have nice growth path of up to 10 nodes (1 shard per node),
size wise you can grow up to 5 shards (not counting replicas) which is up to
5 servers/nodes.
-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with
google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Thanks for the clarification.

On Tue, Apr 19, 2011 at 2:36 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Hi,
No, that index will remain in status red, and there is no way to tell the
cluster that the index will no longer have this shard available. If you do
manage to bring back the machine where that shard is stored, then it will be
recovered.
-shay.banon

On Tuesday, April 19, 2011 at 3:34 PM, Michel Conrad wrote:

Hi Shay,

thank you for your answers. The only thing that is not clear to me is
in case that I have an index
without replicas and I lose a shard, the state will go from green to
red, as yellow cannot happen
when there are no replicas, correct? Is it possible in that case to
tell the index that there will be
one shard less from now one, that the state will be green again. I am
asking because at one point
I am waiting for the cluster state to become yellow, and with an index
on red this will not be possible.

Best,
Michel

On Tue, Apr 19, 2011 at 1:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Yes, you can change the number of replicas dynamically, check teh update
settings API. More answers below:

On Tuesday, April 19, 2011 at 2:11 PM, Michel Conrad wrote:

Hi,
is it possible to change the number of replicas dynamically? Lets say
I index my data into different indices. After every day I switch to a
new index.
If I know from my usage scenario, that the most recent data is the
data my users search for mostly and the older data is needed less
often, would it be a good idea to change
the replica settings on the fly, for instance start with 2 replicas
during the first week, then down to 1 replica for the second week, and
to no replica at all for data older than a month.

Note, if you have no replicas, and you loose a node that had a shard for the
index with no replicas, you will loose that data.

There are some open questions where I didn't find an answer for:

  • If I have no replica and I lose a node, what will happen to the
    cluster? Will queriing still work (without the lost data)?

Querying will still work on whatever data is available, but, that relevant
data is lost.

  • If I start with 2 replicas and I go down to 1, will the data still
    stay on the hdd on the third node, so that I can recover it in a case
    off an HDD failure?

If you reduce to less replicas, then it won't exist on disk. You will now
have 2 copies of the data instead of 3.

  • In order to speed up queriing, what are the prerequesites to use an
    im memory index? Is it possible to keep a backup of the data on HDD
    and use an inmemory index without replica?

I suggest first to check the file based index, with file system cache, it
will be pretty fast.

Regards,
Michel

On Tue, Apr 19, 2011 at 11:14 AM, Kristian Jörg krjg@devo.se wrote:

Hi,

thanks for all information everybody!
I am now getting the grasp on this and feel confident I can manage the
configuration from here on.

/Kristian

Berkay Mollamustafaoglu skrev 2011-04-15 14:24:

Shard/replica configuration depends on your needs. There are many factors,
how many docs, users, number of queries, etc. 50K documents is not much so
you're probably OK with one server unless you have very high number of
users/queries.
Replica is for redundancy but also for improves query performance. If you
need high availability, you should have 2 servers naturally. In this case, 5
shards and 1 replica (note 1 replica means 1 additional copy, total of 2
copies) would be reasonable.
If you will have only 1 server, then 5 shards and no replica is better, you
can always add a replica later, if needed. Adding shards later is not that
easy so even with 1 server, I'd keep number of shards higher. If your memory
is really limited and you don't expect expansion then you can limit the
number of shards to 2.
As you can see, there are a lot of if, then, unless in the statements above.
I'd recommend playing around with it, see how the performance is and adjust
accordingly, rather than trying to figure out everything from the beginning.
It would not be difficult to change your config and re-index 50K docs
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian Jörg krjg@devo.se wrote:

Thanks for the reply.

However I still do not understand how I should configure shards/replicas
in respect to different scenarios. For instance if I use a single node and
end up with 5 shards in it. Would it not be better to use a setup of 1 shard
/ 1 replica? All is on the same node anyways?
Is it better to split the index in many shards and balance those on
several nodes? I guess network protocols will play a role as a limit in that
scenario. And how does replicas play in? In what way does several replicas
affect the situation. Are they for redundance? I.e if a node goes down a
replica of the shard on another node still serves the index? If that is
correct, then what is a good ratio between shards/replicas for a set number
of servers?
Many questions... :slight_smile:

Back to the index I am building. Yes, it is a small index in the Lucene/ES
world. But I expect to add several of the same magnitude to the mix as time
goes.
And I have another application (for library catalogues) which will index
up to a million documents per index, but there each document is rather
small.

/Kristian

Shay Banon skrev 2011-04-15 00:38:

Heya,
In elasticsearch, when you create an index, you define the number of
shards and number of replicas. By default, an index is created with 5 shards
and 1 replica per shard. If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will have 5 shards. Once you
start adding more nodes, those shards will start to get rebalanced between
those nodes.
Thats the gist of it in a very high level manner. One way you can
understand it possibly better is to simply use the cluster state API, it
gives a nice breakdown of indices, shards, and where they are allocated and
on which nodes. You can easily create indices and start several servers on
your laptop and see how it behaves.
Back to your question. This does not sound like a large set of
documents. If you did not play with changing the number of shards and
replica, you have nice growth path of up to 10 nodes (1 shard per node),
size wise you can grow up to 5 shards (not counting replicas) which is up to
5 servers/nodes.
-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with
google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Good discussion. I'm lacking one baseline concept to get a handle on
this myself. Nowhere have I seen "shard" defined. In general terms I
understand "shard" to be a subdivision of data. For example, an RDBMS
table of users could be sharded across servers based on alphabetical
subdivisions of username. I assume a shard in ES is something like
that, but it would really help me to know specifically what a shard is
to ES and how it decides what goes into which shard?

On Apr 19, 7:38 am, Michel Conrad michel.con...@trendiction.com
wrote:

Thanks for the clarification.

On Tue, Apr 19, 2011 at 2:36 PM, Shay Banon

shay.ba...@elasticsearch.com wrote:

Hi,
No, that index will remain in status red, and there is no way to tell the
cluster that the index will no longer have this shard available. If you do
manage to bring back the machine where that shard is stored, then it will be
recovered.
-shay.banon

On Tuesday, April 19, 2011 at 3:34 PM, Michel Conrad wrote:

Hi Shay,

thank you for your answers. The only thing that is not clear to me is
in case that I have an index
without replicas and I lose a shard, the state will go from green to
red, as yellow cannot happen
when there are no replicas, correct? Is it possible in that case to
tell the index that there will be
one shard less from now one, that the state will be green again. I am
asking because at one point
I am waiting for the cluster state to become yellow, and with an index
on red this will not be possible.

Best,
Michel

On Tue, Apr 19, 2011 at 1:15 PM, Shay Banon
shay.ba...@elasticsearch.com wrote:

Yes, you can change the number of replicas dynamically, check teh update
settings API. More answers below:

On Tuesday, April 19, 2011 at 2:11 PM, Michel Conrad wrote:

Hi,
is it possible to change the number of replicas dynamically? Lets say
I index my data into different indices. After every day I switch to a
new index.
If I know from my usage scenario, that the most recent data is the
data my users search for mostly and the older data is needed less
often, would it be a good idea to change
the replica settings on the fly, for instance start with 2 replicas
during the first week, then down to 1 replica for the second week, and
to no replica at all for data older than a month.

Note, if you have no replicas, and you loose a node that had a shard for the
index with no replicas, you will loose that data.

There are some open questions where I didn't find an answer for:

  • If I have no replica and I lose a node, what will happen to the
    cluster? Will queriing still work (without the lost data)?

Querying will still work on whatever data is available, but, that relevant
data is lost.

  • If I start with 2 replicas and I go down to 1, will the data still
    stay on the hdd on the third node, so that I can recover it in a case
    off an HDD failure?

If you reduce to less replicas, then it won't exist on disk. You will now
have 2 copies of the data instead of 3.

  • In order to speed up queriing, what are the prerequesites to use an
    im memory index? Is it possible to keep a backup of the data on HDD
    and use an inmemory index without replica?

I suggest first to check the file based index, with file system cache, it
will be pretty fast.

Regards,
Michel

On Tue, Apr 19, 2011 at 11:14 AM, Kristian Jörg k...@devo.se wrote:

Hi,

thanks for all information everybody!
I am now getting the grasp on this and feel confident I can manage the
configuration from here on.

/Kristian

Berkay Mollamustafaoglu skrev 2011-04-15 14:24:

Shard/replica configuration depends on your needs. There are many factors,
how many docs, users, number of queries, etc. 50K documents is not much so
you're probably OK with one server unless you have very high number of
users/queries.
Replica is for redundancy but also for improves query performance. If you
need high availability, you should have 2 servers naturally. In this case, 5
shards and 1 replica (note 1 replica means 1 additional copy, total of 2
copies) would be reasonable.
If you will have only 1 server, then 5 shards and no replica is better, you
can always add a replica later, if needed. Adding shards later is not that
easy so even with 1 server, I'd keep number of shards higher. If your memory
is really limited and you don't expect expansion then you can limit the
number of shards to 2.
As you can see, there are a lot of if, then, unless in the statements above.
I'd recommend playing around with it, see how the performance is and adjust
accordingly, rather than trying to figure out everything from the beginning.
It would not be difficult to change your config and re-index 50K docs
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian Jörg k...@devo.se wrote:

Thanks for the reply.

However I still do not understand how I should configure shards/replicas
in respect to different scenarios. For instance if I use a single node and
end up with 5 shards in it. Would it not be better to use a setup of 1 shard
/ 1 replica? All is on the same node anyways?
Is it better to split the index in many shards and balance those on
several nodes? I guess network protocols will play a role as a limit in that
scenario. And how does replicas play in? In what way does several replicas
affect the situation. Are they for redundance? I.e if a node goes down a
replica of the shard on another node still serves the index? If that is
correct, then what is a good ratio between shards/replicas for a set number
of servers?
Many questions... :slight_smile:

Back to the index I am building. Yes, it is a small index in the Lucene/ES
world. But I expect to add several of the same magnitude to the mix as time
goes.
And I have another application (for library catalogues) which will index
up to a million documents per index, but there each document is rather
small.

/Kristian

Shay Banon skrev 2011-04-15 00:38:

Heya,
In elasticsearch, when you create an index, you define the number of
shards and number of replicas. By default, an index is created with 5 shards
and 1 replica per shard. If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will have 5 shards. Once you
start adding more nodes, those shards will start to get rebalanced between
those nodes.
Thats the gist of it in a very high level manner. One way you can
understand it possibly better is to simply use the cluster state API, it
gives a nice breakdown of indices, shards, and where they are allocated and
on which nodes. You can easily create indices and start several servers on
your laptop and see how it behaves.
Back to your question. This does not sound like a large set of
documents. If you did not play with changing the number of shards and
replica, you have nice growth path of up to 10 nodes (1 shard per node),
size wise you can grow up to 5 shards (not counting replicas) which is up to
5 servers/nodes.
-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with
google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Sure, I plan to start documenting this aspects post 0.16. Regarding how a document is routed to a shard, by default, it will be routed based on the value of its id (hashing it).

You can control the routing value by providing it when indexing. For example, you can make sure that all docs for a specific user will be routed to the same shard by using the username as the routing value. Why one would like to do that? Because when searching, you can specify routing values (which will in turn choose the shards hte search will be executed on), so when you want to only search for a specific user, you only hit one shard.

-shay.banon
On Wednesday, April 20, 2011 at 3:26 AM, Tim Scott wrote:

Good discussion. I'm lacking one baseline concept to get a handle on
this myself. Nowhere have I seen "shard" defined. In general terms I
understand "shard" to be a subdivision of data. For example, an RDBMS
table of users could be sharded across servers based on alphabetical
subdivisions of username. I assume a shard in ES is something like
that, but it would really help me to know specifically what a shard is
to ES and how it decides what goes into which shard?

On Apr 19, 7:38 am, Michel Conrad michel.con...@trendiction.com
wrote:

Thanks for the clarification.

On Tue, Apr 19, 2011 at 2:36 PM, Shay Banon

shay.ba...@elasticsearch.com wrote:

Hi,
No, that index will remain in status red, and there is no way to tell the
cluster that the index will no longer have this shard available. If you do
manage to bring back the machine where that shard is stored, then it will be
recovered.
-shay.banon

On Tuesday, April 19, 2011 at 3:34 PM, Michel Conrad wrote:

Hi Shay,

thank you for your answers. The only thing that is not clear to me is
in case that I have an index
without replicas and I lose a shard, the state will go from green to
red, as yellow cannot happen
when there are no replicas, correct? Is it possible in that case to
tell the index that there will be
one shard less from now one, that the state will be green again. I am
asking because at one point
I am waiting for the cluster state to become yellow, and with an index
on red this will not be possible.

Best,
Michel

On Tue, Apr 19, 2011 at 1:15 PM, Shay Banon
shay.ba...@elasticsearch.com wrote:

Yes, you can change the number of replicas dynamically, check teh update
settings API. More answers below:

On Tuesday, April 19, 2011 at 2:11 PM, Michel Conrad wrote:

Hi,
is it possible to change the number of replicas dynamically? Lets say
I index my data into different indices. After every day I switch to a
new index.
If I know from my usage scenario, that the most recent data is the
data my users search for mostly and the older data is needed less
often, would it be a good idea to change
the replica settings on the fly, for instance start with 2 replicas
during the first week, then down to 1 replica for the second week, and
to no replica at all for data older than a month.

Note, if you have no replicas, and you loose a node that had a shard for the
index with no replicas, you will loose that data.

There are some open questions where I didn't find an answer for:

  • If I have no replica and I lose a node, what will happen to the
    cluster? Will queriing still work (without the lost data)?

Querying will still work on whatever data is available, but, that relevant
data is lost.

  • If I start with 2 replicas and I go down to 1, will the data still
    stay on the hdd on the third node, so that I can recover it in a case
    off an HDD failure?

If you reduce to less replicas, then it won't exist on disk. You will now
have 2 copies of the data instead of 3.

  • In order to speed up queriing, what are the prerequesites to use an
    im memory index? Is it possible to keep a backup of the data on HDD
    and use an inmemory index without replica?

I suggest first to check the file based index, with file system cache, it
will be pretty fast.

Regards,
Michel

On Tue, Apr 19, 2011 at 11:14 AM, Kristian Jörg k...@devo.se wrote:

Hi,

thanks for all information everybody!
I am now getting the grasp on this and feel confident I can manage the
configuration from here on.

/Kristian

Berkay Mollamustafaoglu skrev 2011-04-15 14:24:

Shard/replica configuration depends on your needs. There are many factors,
how many docs, users, number of queries, etc. 50K documents is not much so
you're probably OK with one server unless you have very high number of
users/queries.
Replica is for redundancy but also for improves query performance. If you
need high availability, you should have 2 servers naturally. In this case, 5
shards and 1 replica (note 1 replica means 1 additional copy, total of 2
copies) would be reasonable.
If you will have only 1 server, then 5 shards and no replica is better, you
can always add a replica later, if needed. Adding shards later is not that
easy so even with 1 server, I'd keep number of shards higher. If your memory
is really limited and you don't expect expansion then you can limit the
number of shards to 2.
As you can see, there are a lot of if, then, unless in the statements above.
I'd recommend playing around with it, see how the performance is and adjust
accordingly, rather than trying to figure out everything from the beginning.
It would not be difficult to change your config and re-index 50K docs
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 15, 2011 at 3:49 AM, Kristian Jörg k...@devo.se wrote:

Thanks for the reply.

However I still do not understand how I should configure shards/replicas
in respect to different scenarios. For instance if I use a single node and
end up with 5 shards in it. Would it not be better to use a setup of 1 shard
/ 1 replica? All is on the same node anyways?
Is it better to split the index in many shards and balance those on
several nodes? I guess network protocols will play a role as a limit in that
scenario. And how does replicas play in? In what way does several replicas
affect the situation. Are they for redundance? I.e if a node goes down a
replica of the shard on another node still serves the index? If that is
correct, then what is a good ratio between shards/replicas for a set number
of servers?
Many questions... :slight_smile:

Back to the index I am building. Yes, it is a small index in the Lucene/ES
world. But I expect to add several of the same magnitude to the mix as time
goes.
And I have another application (for library catalogues) which will index
up to a million documents per index, but there each document is rather
small.

/Kristian

Shay Banon skrev 2011-04-15 00:38:

Heya,
In elasticsearch, when you create an index, you define the number of
shards and number of replicas. By default, an index is created with 5 shards
and 1 replica per shard. If you have 2 nodes, and create a single index,
with 5 shards and 1 replica, then each node will have 5 shards. Once you
start adding more nodes, those shards will start to get rebalanced between
those nodes.
Thats the gist of it in a very high level manner. One way you can
understand it possibly better is to simply use the cluster state API, it
gives a nice breakdown of indices, shards, and where they are allocated and
on which nodes. You can easily create indices and start several servers on
your laptop and see how it behaves.
Back to your question. This does not sound like a large set of
documents. If you did not play with changing the number of shards and
replica, you have nice growth path of up to 10 nodes (1 shard per node),
size wise you can grow up to 5 shards (not counting replicas) which is up to
5 servers/nodes.
-shay.banon

On Thursday, April 14, 2011 at 2:53 PM, Kristian Jörg wrote:

Well, I am probably looking a bit dumb asking this. But anyways here it
goes.
The first time ever I came by the names shards and replicas was when I
started looking into new search engines for my web applications and
ended up here with ES. I understand these terms describe how the index
is divided between instances of nodes in a cluster, but I would really
need a deep understanding of the concepts. Especially as we are now
beginning to put ES in production and will actually start using two
nodes as a cluster.

So any good pointers to info on these concepts would be much
appreciated. I tried googling but there was too much noise and nothing
to actually explain the concepts. First time I came up empty with
google...

I am about to put an index of about 50.000 documents (each document an
OCR interpreted page of a book) on a cluster of two servers. What would
be a good setting for shards and replicas for this type of index and
cluster?

Another question on the topic..
If i have 2 nodes and add 5 indexes with each 5 shards and 1 replica. Does this mean a smooth ride to 25 nodes or how is it calculated then?

On May 27, 8:22 am, "Mr.Johnsson" c.johns...@wocodi.com wrote:

Another question on the topic..
If i have 2 nodes and add 5 indexes with each 5 shards and 1 replica. Does
this mean a smooth ride to 25 nodes or how is it calculated then?

Yes, this means you have 25 shards, which can be spread over up to 25
nodes.

Um – arithmetic – doesn’t this mean 50 shards? 25 primary and 25 replica? So – it could actually be spread over up to 50 nodes?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.commailto:Bob.Sandiford@sirsidynix.com
www.sirsidynix.comhttp://www.sirsidynix.com/

Join the conversation: Like us on Facebook!http://www.facebook.com/SirsiDynix Follow us on Twitter!http://twitter.com/SirsiDynix

From: Eric Jain [via Elasticsearch Users] [mailto:ml-node+s115913n4018467h24@n3.nabble.com]
Sent: Tuesday, May 29, 2012 12:17 PM
To: Bob Sandiford
Subject: Re: Shards and replicas

On May 27, 8:22 am, "Mr.Johnsson" <[hidden email]</user/SendEmail.jtp?type=node&node=4018467&i=0>> wrote:

Another question on the topic..
If i have 2 nodes and add 5 indexes with each 5 shards and 1 replica. Does
this mean a smooth ride to 25 nodes or how is it calculated then?

Yes, this means you have 25 shards, which can be spread over up to 25
nodes.


If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/Shards-and-replicas-tp2819984p4018467.html
To start a new topic under Elasticsearch Users, email ml-node+s115913n115913h50@n3.nabble.commailto:ml-node+s115913n115913h50@n3.nabble.com
To unsubscribe from Elasticsearch Users, click herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=115913&code=Ym9iLnNhbmRpZm9yZEBzaXJzaWR5bml4LmNvbXwxMTU5MTN8LTIxMTYxMTI0NTQ=.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml

25 shards data capacity wise, because you have 1 replica, then the total
shard count is 50...

On Tue, May 29, 2012 at 5:47 PM, Eric Jain eric.jain@gmail.com wrote:

On May 27, 8:22 am, "Mr.Johnsson" c.johns...@wocodi.com wrote:

Another question on the topic..
If i have 2 nodes and add 5 indexes with each 5 shards and 1 replica.
Does
this mean a smooth ride to 25 nodes or how is it calculated then?

Yes, this means you have 25 shards, which can be spread over up to 25
nodes.