That is quite large. Did you get this figure based on a good amount of data so you get the full benefits from compression etc?
This could potentially lead to a lot of small shards. Make sure you reduce the number of primary shards per index if you do this. Given 45 days retention period you may also want to consider weekly indices as a lot of small shards can be inefficient.
Yes, replica shards need to be accounted for.
For logging use cases we see varying disk-to-RAM ratios, often a lot higher than 1:24. The ideal ratio for your use case will depend on data, work load and types of disks, so I would recommend running a benchmark.
As clusters grow, having 3 dedicated master nodes is best practice. I see a lot of logging use cases do without dedicated coordinating nodes, but you may want to have one for each Kibana instance.
I created few input files with origin events from production, working with best compression and created groks in the logstash, including only the important fields we need and want to have in the output. More that I defined a mapping per log in the elasticsearch.
To accurately draw conclusion about how much space data will take up on disk, you should index a reasonably large amount of data. I think we recommend around a GB or so in this video about sizing, but a few hundred MB should be sufficient.
As per the link I provided earlier, aim for a shard size between 10GB and 50GB. If daily indices are too small and weekly indices are too large, increase the number of primary shards. You can also use the rollover index API to switch to a new index based on data volume and/or time (whichever is hit first).
As it looks like you have based the size estimate on a small data set, I would revisit this before determining how much disk space and node count you will need.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.