Document distribution in a cluster

Drew_H · June 16, 2011, 3:21am

Hello,
I have ~10M documents to index and when I add them to my local
elasticsearch server, they consume about 15G of disk space. My
understanding is the documents will distribute evenly and non-
redundantly in a two node cluster and disk requirements will be ~7.5G
on each node for those same ~10M documents. However, after loading
those documents into a 2 node cluster, I've found that both nodes are
using about 15G of disk. Why? Is that what I should expect to see or
am I getting tripped up by a misconfiguration? Both my local and
clustered environments are configured with 1 replica for the index in
question.

Thanks,
Drew

Clinton_Gormley · June 16, 2011, 3:41am

Hi Drew

I have ~10M documents to index and when I add them to my local
elasticsearch server, they consume about 15G of disk space. My
understanding is the documents will distribute evenly and non-
redundantly in a two node cluster and disk requirements will be ~7.5G
on each node for those same ~10M documents. However, after loading
those documents into a 2 node cluster, I've found that both nodes are
using about 15G of disk. Why? Is that what I should expect to see or
am I getting tripped up by a misconfiguration? Both my local and
clustered environments are configured with 1 replica for the index in
question.

If you have 1 replica, that means that you have two copies: ie one
primary and one replica.

So when you start your second node, elasticsearch will add all of the
replica shards, which in your situation will take double the memory that
was consumed by the primaries.

If you were to start a 3rd node, then you would see that the total
memory usage across all 3 nodes would be about 30GB (15 + 15)

clint

Berkay_Mollamustafao · June 16, 2011, 3:41am

By default you have 1replica for each index, so both nodes in your
cluster may have all the data unless you have explicitly set number of
replicas to 0

Berkay

On Wednesday, June 15, 2011, Drew H hite.drew@gmail.com wrote:

Hello,
I have ~10M documents to index and when I add them to my local
elasticsearch server, they consume about 15G of disk space. My
understanding is the documents will distribute evenly and non-
redundantly in a two node cluster and disk requirements will be ~7.5G
on each node for those same ~10M documents. However, after loading
those documents into a 2 node cluster, I've found that both nodes are
using about 15G of disk. Why? Is that what I should expect to see or
am I getting tripped up by a misconfiguration? Both my local and
clustered environments are configured with 1 replica for the index in
question.

Thanks,
Drew

--
Regards,
Berkay Mollamustafaoglu
Ph: +1 (571) 766-6292
mberkay on yahoo, google and skype

Drew_H · June 16, 2011, 4:20am

Thanks for the replies so far, but I'm afraid that I'm still
confused. To rephrase my question more concisely, assuming the same
number of replicas and same documents, should a multi-server es config
consume the same amount of memory as a single server config? If so,
why am I not seeing that in the scenario that I mentioned earlier (my
two node config is consuming 30G while my single node config is using
15G with the same documents indexed)? If a two node config should
consume twice the memory of a single node config for the same set of
documents, in addition to fault tolerance what are the benefits of
such a configuration?

Thanks,
Drew

On Jun 15, 11:41 pm, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Drew

I have ~10M documents to index and when I add them to my local
elasticsearch server, they consume about 15G of disk space. My
understanding is the documents will distribute evenly and non-
redundantly in a two node cluster and disk requirements will be ~7.5G
on each node for those same ~10M documents. However, after loading
those documents into a 2 node cluster, I've found that both nodes are
using about 15G of disk. Why? Is that what I should expect to see or
am I getting tripped up by a misconfiguration? Both my local and
clustered environments are configured with 1 replica for the index in
question.

If you have 1 replica, that means that you have two copies: ie one
primary and one replica.

So when you start your second node, elasticsearch will add all of the
replica shards, which in your situation will take double the memory that
was consumed by the primaries.

If you were to start a 3rd node, then you would see that the total
memory usage across all 3 nodes would be about 30GB (15 + 15)

clint

Clinton_Gormley · June 16, 2011, 4:36am

On Wed, 2011-06-15 at 21:20 -0700, Drew H wrote:

Thanks for the replies so far, but I'm afraid that I'm still
confused. To rephrase my question more concisely, assuming the same
number of replicas and same documents, should a multi-server es config
consume the same amount of memory as a single server config? If so,
why am I not seeing that in the scenario that I mentioned earlier (my
two node config is consuming 30G while my single node config is using
15G with the same documents indexed)? If a two node config should
consume twice the memory of a single node config for the same set of
documents, in addition to fault tolerance what are the benefits of
such a configuration?

You start one node.
You create one index with 2 primary shards and 1 replica (of each
primary shard, ie 4 shards in total).
Because you have only 1 node, only the 2 primary shards are live. No
replicas.
Your primary shards take up 15GB (7.5GB each)
You start a second node
Elasticsearch can now put the replicas live
You now have 2 shards on each node. Each shard takes up 7.5GB, so
you have 15GB on each node, 4 x 7.5GB = 30GB in total
You start a third and fourth node
Elasticsearch can now relocate the shards so that each of the
4 shards is on a separate node.
Each shard still takes up 7.5GB.
Total memory use is still 7.5GB x 4 shards = 30GB

As to your question of why it does it this way, consider this:

you have 2 primary shards on one node only, zero replicas
you start another node
ES moves one primary onto the other node
node 1 dies -> you've lost half your data

Without duplicating the data, it has no way of giving you failover

clint

Topic		Replies	Views
Memory usage Elasticsearch	4	390	July 6, 2017
Shard Elasticsearch	6	320	July 6, 2017
Way to limit # of documents or storage size per node Elasticsearch	4	1678	July 6, 2017
Few queries on setting up a high performing and scalable ES setup Elasticsearch	3	355	July 6, 2017
3 nodes to a cluster and 1 using a lot more disk space Elasticsearch	7	480	July 6, 2017

Document distribution in a cluster

Related topics