I would recommend this video on quantitative cluster sizing, as it describes how to go about answering your questions. There is no easy, analytical formula to plug numbers into, so you will need to benchmark in order to get an accurate answer.
The principles discussed in the video still apply to large use-cases. As a matter of fact, the larger the data volumes, the more important it is to go through a proper sizing exercise in order to identify the answers to your questions.
I almost exclusively see 1 replica used for logging use cases, as the total data volume on disk is one of the factors that drive the cluster size. The ratio between raw data size and the size it takes up once indexed and replicated on disk depends on a number of factors, e.g. type of data you are indexing, how much data you during enrichment and what mappings you use. This blog post, which is getting a bit old, show an example of how mappings and index settings can affect this ratio.
Eklasticsearch applies compression, and as described in the blog post I linked to, there are a couple of options.
As stated in the video, we always recommend you perform a shard sizing benchmark to identify the ideal shard size. When I have benchmarked different types of data I have seen it range from a few GB to a few tens of GB.
No, there is not as it depends on a large number of factors as outlined above.
The amount of data and shards a node can handle typically depends on the heap size. We recommend a maximum heap size of around 30GB, which means nodes with 64GB of RAM seems to be the sweet spot. Another thing that is very important for the performance of Elasticsearch is the speed and throughput of the storage. We generally recommend local attached storage, but also have a lot of users that run on networked storage. The performance of networked storage can however vary a lot, especially since Elasticsearch performs a lot of reads and writes across the data files.
Thanks for the details. Yes, the video gives some basic ideas for planning. I was expected that in the video they would have described one large platform deployments architecture & the hardware specs they used for that deployment.
I think i got have sufficient pointer now after reading few blogs, video, your comments & ELK documentation. I shall come up with better platform for this large data set requirement.
There are a few different architectures that we use for large-scale deployments. It is possible to set up a cluster with all data nodes having the same specification, but for many use-cases with high ingest rates we see the hot/warm architecture being deployed with success. Depending on constraints around hardware and retention periods, one may be more suitable than the other. I work as a Solution Architect at Elastic, and we often help users with large use-cases work through this type of questions in order to get an estimate that is as accurate as possible and provide guidance on architecture and machine specifications.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.