About scalability in data volume

Liang_Yu_Chou · October 25, 2012, 7:05am

For each cluster, I know that I can scale out query capacity by adding
replica nodes.
But is it possible that I can scale out in data volume?

As I know all data are written into one of the primary shards first, then
copied into replicas.
However the number of primary shards has to be defined at the beginning.
Doesn't that
pose a limitation on the max number of instances(for primary shards) in the
cluster?

The default number of primary shard is 5. For future scalability, is there
any drawback if
I set it to a big number?

--

radu_gheorghe · October 25, 2012, 9:46am

Hello Jerry,

The problem with having many shards is that each shard has an
overhead. So while it's OK to over-shard in order to scale out, if you
exaggerate things will get slow, especially on the query side.

Luckily, there are other solutions than starting off with a huge
number of shards. It depends on your data to choose what fits best.
For example, you can add more indices as you go along.

I suggest you take a look at this video, as it's exactly about this topic:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Oct 25, 2012 at 10:05 AM, Jerry Chou fishlet0528@gmail.com wrote:

For each cluster, I know that I can scale out query capacity by adding
replica nodes.
But is it possible that I can scale out in data volume?

As I know all data are written into one of the primary shards first, then
copied into replicas.
However the number of primary shards has to be defined at the beginning.
Doesn't that
pose a limitation on the max number of instances(for primary shards) in the
cluster?

The default number of primary shard is 5. For future scalability, is there
any drawback if
I set it to a big number?

--

dadoonet · October 28, 2012, 5:56pm

I add to Radu's answer that if you have many shards on a single node, you can hit the "too many open files" issue. Each shard is a full Lucene instance.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 25 oct. 2012 à 11:46, Radu Gheorghe radu.gheorghe@sematext.com a écrit :

Hello Jerry,

The problem with having many shards is that each shard has an
overhead. So while it's OK to over-shard in order to scale out, if you
exaggerate things will get slow, especially on the query side.

Luckily, there are other solutions than starting off with a huge
number of shards. It depends on your data to choose what fits best.
For example, you can add more indices as you go along.

I suggest you take a look at this video, as it's exactly about this topic:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Oct 25, 2012 at 10:05 AM, Jerry Chou fishlet0528@gmail.com wrote:

For each cluster, I know that I can scale out query capacity by adding
replica nodes.
But is it possible that I can scale out in data volume?

As I know all data are written into one of the primary shards first, then
copied into replicas.
However the number of primary shards has to be defined at the beginning.
Doesn't that
pose a limitation on the max number of instances(for primary shards) in the
cluster?

The default number of primary shard is 5. For future scalability, is there
any drawback if
I set it to a big number?

--

BillyEm · October 29, 2012, 2:46am

Radu: It'd be nice if "overhead" were defined. Since resources are almost
always in a cost-benefit relationship in a datacenter.

Jerry: as to sharding out before you need it. Take a look at routing, and
installing your own hash algorithm. Remember Computer 101, about why there
isn''t just 1 implementation of hash. You could use the two to distribute
your data onto only the resources benefiting you. The PLAN to scale out via
the same algorithmic relationships would have to be done as part of phase

Complexity changes orders of mag ... right if you don't. . As always,
knowing your content, if you can, (sometimes not possible in the real world
folks) will make a world of of difference.

g'luck.

On Thursday, October 25, 2012 3:05:51 AM UTC-4, Jerry Chou wrote:

For each cluster, I know that I can scale out query capacity by adding
replica nodes.
But is it possible that I can scale out in data volume?

As I know all data are written into one of the primary shards first, then
copied into replicas.
However the number of primary shards has to be defined at the beginning.
Doesn't that
pose a limitation on the max number of instances(for primary shards) in
the cluster?

The default number of primary shard is 5. For future scalability, is there
any drawback if
I set it to a big number?

--

Topic		Replies	Views
Considering scalability , is it right to keep a large number of primary shards at beginning？ Elasticsearch	5	421	July 6, 2017
Few queries on setting up a high performing and scalable ES setup Elasticsearch	3	353	July 6, 2017
Figuring out the optimal number of shards Elasticsearch	6	1653	July 6, 2017
Scaling ES Cluster and balacing shards (primary, replica) Elasticsearch	1	619	July 5, 2017
Dynamic growing, a solution for a fixed shard number? Elasticsearch	2	1820	July 6, 2017

About scalability in data volume

Best regards, Radu

Best regards, Radu

Related topics

Best regards,
Radu

Best regards,
Radu