I have a new deployment to plan and I need to estimate the hardware requirements. On my test deployment the index size is about 180 GB. I was planning on deploying on machines with 64 GB RAM and allocating 31 GB RAM for the Java heap on each machine. I understand that this is not an exact science and that there are many factors to consider. But as a general rule of thumb, would you recommend an initial sizing for enough RAM to hold the entire index, in other words, starting with six shards (6 * 30 GB = 180 GB), each with 31 GB? Second, if I want one replica, would I still plan for a total of 6 shards, 3 primaries and 3 replicas? Or should I plan for 6 primaries and 6 replicas, each with 31 GB heap? I do understand that I will adjust this, but I need to plan for costs and am looking for a general rule. Thanks!
This depends a lot on the use case, which you have not told us anything about. Is the use case search heavy? What type of data and query requirements? How much indexing and/ or updating?
The use case is search heavy and the individual documents to be indexed are English text, about 5kb in size. Query requirements are expected to be 10-50 qps. After the initial offline load, the index will grow daily (but slowly). and will be used primarily for Google-like searches, matching multiple fields and retrieving snippets from 50 documents for each query. The system is mission critical but the expectation is that the system configuration may be tweaked after the initial deployment.
My thinking is to provide sufficient RAM for the entire index to be mapped. So for a 180GB index I would start with 6 shards, one shard per node, with each shard given 30GB heap. I would duplicate that for each replica, so 12 nodes for 1 replica, 18 for two replicas, and so on. That would be my INITIAL SIZING. I'm asking the community here to comment on that rule of thumb, or if you have others that are more reliable.
Elasticsearch relies on heap space for storage of some data structures and working memory, but the full shard is not kept on the heap. You may therefore not need such a large heap, and it is generally recommended to run with as small heap as possible. It has to be large enough so you do not suffer from heap pressure though. It also uses off-heap memory for some other data structures, but relies on the operating system page cache for quick access to the data in your shards. If you have a search heavy use case, performance will improve if you can make sure all your shards are cached.
You should also look at there recommendations.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.