I found out that one of the method for cluster planning is though Throughput Sizing , in which it uses number of search per second, search response time, physical cores etc etc. to calculate how many nodes needed.
Below is the formula I get:
Peak Threads = ROUNDUP(Peak searches per second * Average search response time in milliseconds / 1000 Milliseconds)
Thread Pool Size = ROUNDUP((Physical cores per node * Threads per core * 3 / 2) + 1)
Total Data Nodes = ROUNDUP(Peak threads / Thread pool size)
Now questions:
How to get average search response time in milliseconds?
Is it by running a simple query? Like the default search all or need to run our own custom query to decide this? Then where to get this statistic? Though API or need outside application?
For average search response time in milliseconds, the higher the better? Or the lower the better?
Because I noticed that when I use excel to populate this formula, the higher the value of this parameter, the more nodes I need. But isn't it should be the other way around?
I do not think you can use a formula to calculate this as it will depend on a number of factors, e.g. data, mappings, types of queries, hardware and cache hit ratio.
How large is your data set? Will it comfortably fit in the cache on a single node? Are you going to index or update data concurrently? What type of hardware are you planning to deploy on?
I do not think you can use a formula to calculate this as it will depend on a number of factors
That's weird as the formula is the one that provided in one of the capacity planning webminar. Frankly I myself definitely can't come out with those formulas.
Let's just say that I have multiple data set, with each can range from few GB up to 40GB, no update, just reindex per week, and plan to deploy on VM.
The reason why i had asked this kind of question is because I hope that there is some guidelines on the cluster planning ie how many data nodes needed etc.
I am not familiar with this webinar nor the formula. When I worked for Elastic and talked about performance and capacity planning the advice was always to benchmark due to the large number of factors affecting performance.
It looks to me like the formula above assumes that the entire data set can be cached in memory so that the search is CPU limited and does not depend on disk I/O as this makes it much more likely that the query latency will vary based on the number of queries per second. If that is not the case in your use case, it may not be applicable, but it would be good to get some feedback from the authour of the webinar about what assumptions go into this formula.
Given those caveats and assumptions, let me try to answer your questions.
If your full data set fit in the OS page cache you need to run you own queries on your own data and measure latency. I would create a script to do this or use esrally.
The more expensive queries you have the more work Elasticsearch need to do, which leads to longer latencies. This means you need more nodes the longer your queries take to run.
If you are in a situation where your data set does not completely fit into the OS page cache I would recommend benchmarking. Set up a node or small cluster containing your full data set. For some use cases it makes sense to put all data on all nodes while for others you tend to spread the data out and have fewer replicas but a larger portion of the data cached.
Then benchmark sending queries at the cluster. If you are going to index or update concurrently, make sure you include this load in the benchmark. Start with a low query concurrency level (but full indexing load) and gradually increase until query latencies are no longer deemed acceptable. This indicates you have reached the limit of your cluster. If you require a query throughput twice what you achieved you then basically need to double the size of your cluster. Then add a node or two for redundancy so you can handle node downtime and still have enough resources to meet your SLA.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.