My name is Alex and I'm a fullstack developer. I have to host an ElasticSearch cluster which main role will be to serve data for Java client applications. I've noticed that a lot of things are well documented for ES, but I didn't found much explanations when it came to defining VM hardware config and cluster configuration to address needs.
However, I have to say that this article is well written and made my vision more clear about that : https://www.elastic.co/blog/found-sizing-elasticsearch
Let me explain you what we are expecting
- We have an index of less than 1 millions of documents.
- We will have 3 VMs available for the project (1 which will act as a reverse proxy to authenticate queries through HAProxy, and 2 which will host each an ES node).
In theory, we have a bit more than 9000 clients apps in the wild, and they will be making to the maximum, 100 requests per day, on a 12h time range. So :
9000 * 100 = 900 000
900 000/12 = 75 000 requests/hour -> 75 000/3600 = 21 requests/second (max)
The queries that we will do are basic search operation (no aggregations, no intelligence, ...) : only full-text search and wildcard on one or more fields.
We plan to build our VM for ES to : 4GB RAM, 2-core classic CPU and 20 GB of storage, running under CentOS. These will be dedicated only to run ES.
Are this config coherent with our approx. max. load ? How many primary and replica shards should we create for the index ? is 2 nodes not concerned by the split-brain issue if our HAProxy do the loadbalancing and the failover on a highest level ?
Thanks a lot for reading this and for your pro tips !
PS: I ElasticSearch, but I'm quite a noob when it comes to build a complete cluster optimized for our need)