I'm sorry if this question has been asked many times before, but the usual response to this types of questions seems that it depends on what you're planning on doing, so I figured I'd ask if my own calculations are correct as well.
The plan is to build an elk stack to gather logging for a number of servers. Looking at the log files currently, we would be at 130GB per day. This is without buffer for expansion, but our current logs are made very human readable, and we kind of assume that with an efficient logstash config we can reduce this number.
The system would be used by a limited number of users, as its mostly going to be used for performance analysis and troubleshooting of a platform.
The current idea is that we'd like to keep the data for 90 days. This seems a lot already though, and we might downsize that, but lets stick with 90 days for now.
I may be completely wrong about this, but some older information/posts on the internet suggest that I have to multiply my data by 5 to get actual storage, and thats without taking into account replicas. But other articles (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0) suggest that a factor of 1.4 might be enough already. Which is correct?
Using the multiplication by 5 in my calculation, I would estimate that I need the following config/hardware:
- a minimum of 3 shards (based on max 50GB per shard)
- 114TB total cluster storage
- minimum of 9 data nodes with 2x8TB in raid0
While if I multiply by 1.4, it obviously becomes a completely different story:
- a minimum of 3 shards (based on max 50GB per shard)
- 32TB total cluster storage
- a minimum of 4 data nodes with 2x8TB in raid0
Of course a different disk plan could be used, but I figured I'd try to stick with a 2-disk raid0 to keep things simple (since the more disks in raid0 the worse the odds of it failing).
Which calculation would be the best way to estimate the needed storage, or is there even more I'm missing here?
For example, how does ram come into the picture here, would 64GB be enough?