TLDR: what specs are appropriate for client, data, and master nodes when ingesting 250GB/day?
Hi All,
I've been tasked with building out an ELK Stack as my company would like to move away from Splunk. We've already began using the ELK Stack template that AWS provides but we would like more control over configuration. With that being said, I've been reading a lot of documents and I think I have a good idea of the specs for each node but still wanted to reach out to the community in case someone had a more definitive answer. This clustered ELK environment would be ingesting around 250GB/day and possibly growing in the near future. This is what I was thinking:
10 total nodes
2 client nodes
3 data nodes
3 master nodes
1 Kibana
2 Logstash
From my reading I have leaned that the master node doesn't require that much in RAM and HD so I figure maybe 8GB in RAM and 50 in HD?
From my reading I have learned that the data nodes work best at 64GB but still works well at 32GB. I was thinking maybe 2TB for each data node. At least 1yr retention on indexes.
I am not sure at all what the specs should be for the client nodes?
As each shard holds a finite amount of data and is associated with some overhead in terms of memory, file handles and CPU, the more hap you have on a node the more data it can hold. As we generally recommend heap to be < 32GB and 50% of available RAM to be used for heap, 64GB RAM per node is often considered the sweet spot.
As dedicated master nodes are not serving traffic and just manage the cluster, they generally just need a few CPU cores and 4-8GB RAM. As they do not hold data, heap can be set to 75% of the available host memory. Client nodes may be useful, but are generally not necessary for a lot of logging use cases.
If you have 250GB per day and want to keep that for 1 year, that corresponds to around 90TB of raw data. Based on that I would expect you to need more disk space on the data nodes as well as a larger number of data nodes. Exactly how much space that amount of data will take up on disk once indexed will largely depend on how you optimise your mappings. Although it is getting a bit old, this blog post illustrates the effect different mappings can have.
Also, what about an ingest node? Sorry I found a new document(there are so many ;-).....Where does the ingest node come into play? Is it the same as Logstash?
Ingest node is a new node type in Elasticsearch 5.x which allows you to transform indexing requests prior to writing them to Elasticsearch. It supports a subset of the functionality available in Logstash and can allow for a simpler architecture in some cases. If you decide to use them, you should probably use dedicated ingest nodes as they like Logstash can be CPU intensive.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.