The webinar above showcases bunch of formulas for Elasticsearch cluster sizing. There are discussions where responses show unfamiliarity with the formulas or techniques given in the webinar. The discussion also follows saying these formulas have many assumptions behind it. So my question is
What are those assumptions, which is not specified in the webinar or presentation, behind the following formulas,
Volume sizing:
I know the sizing depends on many factors and use cases and there is no fixed technique.. After looking at some discussions I'm starting to wonder the feasibility of these formulas. So I'm requesting some clarity on these formulas and also use cases that it would be helpful in.
My scenario is I want to size a cluster by deciding the number of nodes in it and resources like disk size CPU RAM etc allocated to those nodes. So Im using Volume Sizing to find the total disk space and number of nodes... and Throughtput sizing to find Thread pool size. Im confused as in which output(Total Data nodes) to consider for number of nodes, is it from Volume sizing or throughput sizing... both formulas has total Data nodes as output at the end.
Actually I'm trying to build a utility that considers several factors like Document size, document count per day, retention period, number of replicas, read rate, write rate etc... and give a output like how many number of nodes you need and amount of storage, RAM and CPU for those nodes. Thats when I came across those formulas and thought it was appropriate for what I'm doing until I came across that discussion which I have mentioned above which is creating some doubts for me about the formulas. talking about the use case Ideally this utility should work for any use case.. Is this even possible ?
This webinar and the given formulas assume a standard log and metrics use case where immutable data is ingested and queried a a reasonable level. They will give you a rough estimate of the cluster and node size needed. There are a lot of factors that affect the sizing, e.g. query rates, acceptable query latencies, storage performance, peak to average ingest ratios, data and quey complexity etc, so i always recommend testing with real data or running a realustic benchmark to get a more accurate estimate.
Search use cases are often sized very differently and these formulas would not apply.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.