I would like to have a rough estimate on how much disk space is required for a cluster. I found a paper that includes the following equation for calculating disk space:
I am trying to find any Elastic documentation on estimating disk space, but cannot find any. The paper did not mention if the equation was just made up or taken from some source.
It's not completely inaccurate but it is certainly an oversimplification. The 0.85 in the denominator indicates that the size of each shard on disk is about 117% (1/0.85) of the input size. In fact the ratio of input size to size on disk can vary greatly depending on your configuration. Here is an old blog post showing various configurations with ratios between 61% and 140%:
As David points out this seems to be a simpification. I do however wonder if the 0.85 factor is meant to account for the fact that watermarks will prevent you from using the full disk capacity and that you need some headroom. This blog post also provides some information as does this one.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.