Shards / replicas for 14 nodes setup

Hi, Ill try again, reformulating my question a bit since I didnt get a
reply on my last try :slight_smile:

We have java application with embedded elasticsearch in a cluster of 14
nodes. All the data resides in a central database, and they are indexed in
elasticsearch for querying. A full reindex can be done at any time.

The system are very query-heavy, the amount of writes are small. The number
of documents will not be higher than, say, 300.000.
The size of each document varies greatly, from just a couple of ids, to
extracted text from e.g word-documents of several pages.

I want to make sure that in case of a total breakdown, it should be
sufficient that one or two nodes are available for the system to work.

Write consistency should not be a problem since the master copy of the data
is in the database, and it seems that ES is capable of resolving
conflicting data by using the newest version (which should be all right in
our case)

My first though is to use 1 shard, and 13 replicas. This will naturally
ensure that all nodes have access to all data. This could also be
accomplished by having 2 shards / 13 replicas, so this yield that to ensure
that all data is available, the number of replicas should be the number of
nodes - 1, not depending on the number of shards (which could be anything).

If the requirement of number of nodes are reduced to "2 nodes should be up
at any time", then a shards / replica distribution of "x/number of nodes -
2" should be sufficient.

So, for the questions:

  1. Are my thoughts about the write consistency not being a problem correct,
    implying that ES will be able to resolve conflicts by using the newest
    version?
  2. Are my thoughts about the availability and number of replicas correct?
  3. Given the nature of the application and the number of documents, are
    there any recommendation regarding the number of shards? Is it ok to run
    with 1 shard / 13 replicas?

best regards

mvh

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello Runar,

On Wed, Jun 19, 2013 at 11:53 AM, Runar Myklebust runar.a.m@gmail.comwrote:

Hi, Ill try again, reformulating my question a bit since I didnt get a
reply on my last try :slight_smile:

We have java application with embedded elasticsearch in a cluster of 14
nodes. All the data resides in a central database, and they are indexed in
elasticsearch for querying. A full reindex can be done at any time.

The system are very query-heavy, the amount of writes are small. The
number of documents will not be higher than, say, 300.000.
The size of each document varies greatly, from just a couple of ids, to
extracted text from e.g word-documents of several pages.

I want to make sure that in case of a total breakdown, it should be
sufficient that one or two nodes are available for the system to work.

Write consistency should not be a problem since the master copy of the
data is in the database, and it seems that ES is capable of resolving
conflicting data by using the newest version (which should be all right in
our case)

My first though is to use 1 shard, and 13 replicas. This will naturally
ensure that all nodes have access to all data. This could also be
accomplished by having 2 shards / 13 replicas, so this yield that to ensure
that all data is available, the number of replicas should be the number of
nodes - 1, not depending on the number of shards (which could be anything).

If the requirement of number of nodes are reduced to "2 nodes should be up
at any time", then a shards / replica distribution of "x/number of nodes -
2" should be sufficient.

So, for the questions:

  1. Are my thoughts about the write consistency not being a problem
    correct, implying that ES will be able to resolve conflicts by using the
    newest version?

I guess they are. If you try to index a document with an different version
that what's expected, you'll get an error.

Also, unless you enable async replication, the indexing operation will
finish only after all the replicas have finished indexing the document -
which is good for consistency across replicas.

  1. Are my thoughts about the availability and number of replicas correct?

Yep. In short, if you have X replicas, any X nodes can go down and you
should still be good. If more than X nodes go down, it depends on which of
those nodes go down.

  1. Given the nature of the application and the number of documents, are
    there any recommendation regarding the number of shards? Is it ok to run
    with 1 shard / 13 replicas?

You can run with 1 shard and 13 replicas. 13 replicas should give you the
best concurrency. Not sure if 1 shard is best, though. As far as I've
tested, I didn't see any difference between 1 shard and the default 5
shards on a single node - but I guess it depends on the node's hardware as
well :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.