Hello,
I have ~10M documents to index and when I add them to my local
elasticsearch server, they consume about 15G of disk space. My
understanding is the documents will distribute evenly and non-
redundantly in a two node cluster and disk requirements will be ~7.5G
on each node for those same ~10M documents. However, after loading
those documents into a 2 node cluster, I've found that both nodes are
using about 15G of disk. Why? Is that what I should expect to see or
am I getting tripped up by a misconfiguration? Both my local and
clustered environments are configured with 1 replica for the index in
question.
I have ~10M documents to index and when I add them to my local
elasticsearch server, they consume about 15G of disk space. My
understanding is the documents will distribute evenly and non-
redundantly in a two node cluster and disk requirements will be ~7.5G
on each node for those same ~10M documents. However, after loading
those documents into a 2 node cluster, I've found that both nodes are
using about 15G of disk. Why? Is that what I should expect to see or
am I getting tripped up by a misconfiguration? Both my local and
clustered environments are configured with 1 replica for the index in
question.
If you have 1 replica, that means that you have two copies: ie one
primary and one replica.
So when you start your second node, elasticsearch will add all of the
replica shards, which in your situation will take double the memory that
was consumed by the primaries.
If you were to start a 3rd node, then you would see that the total
memory usage across all 3 nodes would be about 30GB (15 + 15)
By default you have 1replica for each index, so both nodes in your
cluster may have all the data unless you have explicitly set number of
replicas to 0
Hello,
I have ~10M documents to index and when I add them to my local
elasticsearch server, they consume about 15G of disk space. My
understanding is the documents will distribute evenly and non-
redundantly in a two node cluster and disk requirements will be ~7.5G
on each node for those same ~10M documents. However, after loading
those documents into a 2 node cluster, I've found that both nodes are
using about 15G of disk. Why? Is that what I should expect to see or
am I getting tripped up by a misconfiguration? Both my local and
clustered environments are configured with 1 replica for the index in
question.
Thanks,
Drew
--
Regards,
Berkay Mollamustafaoglu
Ph: +1 (571) 766-6292
mberkay on yahoo, google and skype
Thanks for the replies so far, but I'm afraid that I'm still
confused. To rephrase my question more concisely, assuming the same
number of replicas and same documents, should a multi-server es config
consume the same amount of memory as a single server config? If so,
why am I not seeing that in the scenario that I mentioned earlier (my
two node config is consuming 30G while my single node config is using
15G with the same documents indexed)? If a two node config should
consume twice the memory of a single node config for the same set of
documents, in addition to fault tolerance what are the benefits of
such a configuration?
I have ~10M documents to index and when I add them to my local
elasticsearch server, they consume about 15G of disk space. My
understanding is the documents will distribute evenly and non-
redundantly in a two node cluster and disk requirements will be ~7.5G
on each node for those same ~10M documents. However, after loading
those documents into a 2 node cluster, I've found that both nodes are
using about 15G of disk. Why? Is that what I should expect to see or
am I getting tripped up by a misconfiguration? Both my local and
clustered environments are configured with 1 replica for the index in
question.
If you have 1 replica, that means that you have two copies: ie one
primary and one replica.
So when you start your second node, elasticsearch will add all of the
replica shards, which in your situation will take double the memory that
was consumed by the primaries.
If you were to start a 3rd node, then you would see that the total
memory usage across all 3 nodes would be about 30GB (15 + 15)
Thanks for the replies so far, but I'm afraid that I'm still
confused. To rephrase my question more concisely, assuming the same
number of replicas and same documents, should a multi-server es config
consume the same amount of memory as a single server config? If so,
why am I not seeing that in the scenario that I mentioned earlier (my
two node config is consuming 30G while my single node config is using
15G with the same documents indexed)? If a two node config should
consume twice the memory of a single node config for the same set of
documents, in addition to fault tolerance what are the benefits of
such a configuration?
You start one node.
You create one index with 2 primary shards and 1 replica (of each
primary shard, ie 4 shards in total).
Because you have only 1 node, only the 2 primary shards are live. No
replicas.
Your primary shards take up 15GB (7.5GB each)
You start a second node
Elasticsearch can now put the replicas live
You now have 2 shards on each node. Each shard takes up 7.5GB, so
you have 15GB on each node, 4 x 7.5GB = 30GB in total
You start a third and fourth node
Elasticsearch can now relocate the shards so that each of the
4 shards is on a separate node.
Each shard still takes up 7.5GB.
Total memory use is still 7.5GB x 4 shards = 30GB
As to your question of why it does it this way, consider this:
you have 2 primary shards on one node only, zero replicas
you start another node
ES moves one primary onto the other node
node 1 dies -> you've lost half your data
Without duplicating the data, it has no way of giving you failover
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.