About scalability in data volume

For each cluster, I know that I can scale out query capacity by adding
replica nodes.
But is it possible that I can scale out in data volume?

As I know all data are written into one of the primary shards first, then
copied into replicas.
However the number of primary shards has to be defined at the beginning.
Doesn't that
pose a limitation on the max number of instances(for primary shards) in the
cluster?

The default number of primary shard is 5. For future scalability, is there
any drawback if
I set it to a big number?

--

Hello Jerry,

The problem with having many shards is that each shard has an
overhead. So while it's OK to over-shard in order to scale out, if you
exaggerate things will get slow, especially on the query side.

Luckily, there are other solutions than starting off with a huge
number of shards. It depends on your data to choose what fits best.
For example, you can add more indices as you go along.

I suggest you take a look at this video, as it's exactly about this topic:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Oct 25, 2012 at 10:05 AM, Jerry Chou fishlet0528@gmail.com wrote:

For each cluster, I know that I can scale out query capacity by adding
replica nodes.
But is it possible that I can scale out in data volume?

As I know all data are written into one of the primary shards first, then
copied into replicas.
However the number of primary shards has to be defined at the beginning.
Doesn't that
pose a limitation on the max number of instances(for primary shards) in the
cluster?

The default number of primary shard is 5. For future scalability, is there
any drawback if
I set it to a big number?

--

I add to Radu's answer that if you have many shards on a single node, you can hit the "too many open files" issue. Each shard is a full Lucene instance.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 25 oct. 2012 à 11:46, Radu Gheorghe radu.gheorghe@sematext.com a écrit :

Hello Jerry,

The problem with having many shards is that each shard has an
overhead. So while it's OK to over-shard in order to scale out, if you
exaggerate things will get slow, especially on the query side.

Luckily, there are other solutions than starting off with a huge
number of shards. It depends on your data to choose what fits best.
For example, you can add more indices as you go along.

I suggest you take a look at this video, as it's exactly about this topic:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Oct 25, 2012 at 10:05 AM, Jerry Chou fishlet0528@gmail.com wrote:

For each cluster, I know that I can scale out query capacity by adding
replica nodes.
But is it possible that I can scale out in data volume?

As I know all data are written into one of the primary shards first, then
copied into replicas.
However the number of primary shards has to be defined at the beginning.
Doesn't that
pose a limitation on the max number of instances(for primary shards) in the
cluster?

The default number of primary shard is 5. For future scalability, is there
any drawback if
I set it to a big number?

--

--

Radu: It'd be nice if "overhead" were defined. Since resources are almost
always in a cost-benefit relationship in a datacenter.

Jerry: as to sharding out before you need it. Take a look at routing, and
installing your own hash algorithm. Remember Computer 101, about why there
isn''t just 1 implementation of hash. You could use the two to distribute
your data onto only the resources benefiting you. The PLAN to scale out via
the same algorithmic relationships would have to be done as part of phase

  1. Complexity changes orders of mag ... right if you don't. . As always,
    knowing your content, if you can, (sometimes not possible in the real world
    folks) will make a world of of difference.

g'luck.

On Thursday, October 25, 2012 3:05:51 AM UTC-4, Jerry Chou wrote:

For each cluster, I know that I can scale out query capacity by adding
replica nodes.
But is it possible that I can scale out in data volume?

As I know all data are written into one of the primary shards first, then
copied into replicas.
However the number of primary shards has to be defined at the beginning.
Doesn't that
pose a limitation on the max number of instances(for primary shards) in
the cluster?

The default number of primary shard is 5. For future scalability, is there
any drawback if
I set it to a big number?

--