Elasticsearch sizing calculation


(Mark van Rijbroek) #1

I'm sorry if this question has been asked many times before, but the usual response to this types of questions seems that it depends on what you're planning on doing, so I figured I'd ask if my own calculations are correct as well.

The plan is to build an elk stack to gather logging for a number of servers. Looking at the log files currently, we would be at 130GB per day. This is without buffer for expansion, but our current logs are made very human readable, and we kind of assume that with an efficient logstash config we can reduce this number.

The system would be used by a limited number of users, as its mostly going to be used for performance analysis and troubleshooting of a platform.

The current idea is that we'd like to keep the data for 90 days. This seems a lot already though, and we might downsize that, but lets stick with 90 days for now.

I may be completely wrong about this, but some older information/posts on the internet suggest that I have to multiply my data by 5 to get actual storage, and thats without taking into account replicas. But other articles (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0) suggest that a factor of 1.4 might be enough already. Which is correct?

Using the multiplication by 5 in my calculation, I would estimate that I need the following config/hardware:

  • a minimum of 3 shards (based on max 50GB per shard)
  • 114TB total cluster storage
  • minimum of 9 data nodes with 2x8TB in raid0

While if I multiply by 1.4, it obviously becomes a completely different story:

  • a minimum of 3 shards (based on max 50GB per shard)
  • 32TB total cluster storage
  • a minimum of 4 data nodes with 2x8TB in raid0

Of course a different disk plan could be used, but I figured I'd try to stick with a 2-disk raid0 to keep things simple (since the more disks in raid0 the worse the odds of it failing).

Which calculation would be the best way to estimate the needed storage, or is there even more I'm missing here?
For example, how does ram come into the picture here, would 64GB be enough?


(Mark Walkom) #2

That sounds super old and/or entirely wrong. Where does it say this?
I'd suggest our blog is probably a better guide.


(Christian Dahlqvist) #3

The blog post highlights the impact the various mapping options can have on the index size compared to the raw data. As your data will be different from the sample data set used, your ratio is also likely to differ. The best way to determine this is to optimise the mappings based on your needs and the index a good amount of data (using very small data sets and small shard results in less efficient compression).

If you have hardware available, I would also recommend benchmarking your cluster to determine how much data you can handle per node with acceptable query and indexing performance.


(system) #4