Hello,
we have a pretty big cluster (26 data nodes, almost 100TB), and I have a question about the disk usage distribution across data nodes.
I know that Elasticsearch takes to consideration the equality between number of shards rather than disk usage, and in case that shards sizes are not so averaged, it causes very big differences in nodes disk usage:
Hey @porscheme ,
can you be more specific?
I'll try to expand as much as possible, hopefully it will answer your question.
Our cluster runs on EC2 nodes of type i3en.2xlarge, installed with RPM, configured max heap (31GB) per data nodes, and each data nodes has 5TB disk space.
First of all, I refrain from saying about myself that I am an expert,
but from my experience (maintaining this cluster over 3.5 years) I can tell that the maximum cluster sizing is 31GB for heap size, therefore you can have 64GB RAM on each node as maximum.
Plus, the shard size should also be close to the heap size, so I believe it will be better to stick around 30-40GB rather than 75GB.
If someone else has other insights/suggestions, I would like to hear them, but this is my opinion.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.