I've seen some confusing (seemingly contradicting) official guidelines about sizing an ElasticSearch cluster, so I'd like to know if somebody can shed some light on this.
These are the guidelines in question:
As per the ES sizing and capacity planning webinar (minute 0:49:16) we should keep the memory-disk ratio our hot data nodes at 1:30. So one hot node with 64GiB of RAM should have a disk no larger than 1920 GiB.
... it is common to see shards between 20GB and 40GB in size; and:
... a good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards
And here comes what I think is a contradiction, suppose a machine with 64GiB of RAM (30GiB of Java heap). According to the first recommendation, the disk should be at most 1920GiB. If I make each shard 30GiB as per the second recommendation, I could only fit 64 of them in a single node, but the 3rd recommendation says ES supports 600 shards on that same node with 64GiB of RAM. So there is a mismatch of about 9x about how many shards I can put in that node.
I understand these are guidelines, not exact math, but they suggest things so different that it's not possible to follow them.
So, my questions are:
Under which situation can I put 600 shards on a node if the ideal shard size is around 30GiB, which means I need a disk of 18TiB, which goes way above the 1:30 ratio?
In the 1:30 ratio, is the ratio really "Physical RAM" vs. "Physical Disk"? Or should it be "Physical RAM" vs "Disk used by shards"? Because if you notice, all those disk measures are not counting for leaving 20% of the disk empty (as recommended in other guidelines).
Shouldn't the 1:30 ratio consider the size of each shard? It's not the same putting in the disk a 30 shards of 20GiB vs. 15 shards of 40GiB, as the total shard count affects how much heap is used.
As you say, these are guidelines, designed to help people get a rough idea of cluster sizing and avoid common mistakes. They're likely based on experience of different workloads with different Elasticsearch versions, and I'm sure you can refine them for your specific case with careful benchmarking. There is no straight answer to the question of cluster sizing.
You typically want fast and expensive storage for your hot nodes, so keeping them smaller makes more economic sense, but warm nodes can be larger. Our managed service targets a 1:160 ratio for warm nodes. If you have a 30GB heap then the 20-per-GB ratio limits you to 600 shards per node, which at 30GB per shard is 18TB. If you have 128GB of RAM and target a 1:160 ratio then that's 20TB. That's close enough for this kind of rough guideline IMO.
I think that's within the error bounds.
Yes of course, there are many other factors that this kind of guideline doesn't take into account.
If you have a 30GB heap then the 20-per-GB ratio limits you to 600 shards per node, which at 30GB per shard is 18TB.
That calculation makes sense. But then, how does the 1:30 ratio apply in this same scenario?
It would say that for those same 30GB heap (which means 64GB RAM) we can only have a disk of 1.9TB (64*30 = 1920GB). So each guideline gives a drastically different disk estimate, one suggests a limit of 18TB and the other 1.9TB, so approximately a 10x difference.
One of the most common problems when Elasticsearch is used with time-based indices is that users end up with far too many small shards. This results in a large cluster state and generally poor performance. Searching a single small shard is faster that searching a single large one (latency tends to be proportional to size) but searching lots of small smards is not necessarily faster that seacrching a few large ones. This is where the recommendation about shards per node and shard size comes from. A 40GB-50GB shard is generally quite performant for time series data, although this can depend on document size and mappings.
If you read the blog post again I think you will notice that it recommends a maximum of 20 shards per GB heap. It does not state that a node should be able to handle 600 shards of any particular size. Staying below 20 shards per GB heap will generally avoid you suffering problems associated with too many shards, but if you are using large shards I would expect you to have far less shards than this.
When it comes to sizing different types of nodes I tend to think of it in terms of balancing resource usage, e.g. heap memory, CPU and disk I/O. Different types of nodes will have different resource profiles. Hot nodes may have the same heap size as a warm or cold node, but generally has much higher I/O capacity and more CPU cores. The mount of resources required also depends on how the node is used. This webinar discusses this and talks about how the indexed data on disk affects heap usage.
Hot nodes that serve requests will he parsing and indexing a lot of data, which at high ingestion rates require a good amount of heap and CPU. The process of writing indexed data to disk and merging segments is also quite I/O intensive. As indexing data is an expensive process you often want to generate time-based indices with the shard size you want to use in later phases, although you can use the shrink index API to adjust this. All this processing means there is less heap memory that can be used for indexed data on disk, which is why this type of node tend to store relatively little data. I have seen many cases where RAM-to-disk ratios have been less than 1:30 for an optimal cluster. A common assumption is also that the most recent data is the most heavily queried and that data gets queried less the older it gets. This means that hot nodes also serve queries for the data they hold, which further uses resources in tems of both heap, CPU and disk I/O.
If you compare this to warm nodes that do not perform any indexing and where all almost all resources can be dedicated to storing data and serving queries, these are able to allocate much more heap to holding data on disk. Even though they may have slower storage with worse I/O performance, the fact that they primarily only serve queries means this goes further.
For cold nodes that store indices that are frozen heap usage is even lower. These nodes can store data at very high RAM-to-disk ratios as indexed data in frozen state uses very little heap. A very small node can serve very large amounts of data and the shards per heap size in Gb does not apply here. The amount of data stored on this type of node is often limited by the required query latency rather than system resources.
In the end this is as David says just high level guidelines to avoid the most common mistakes. All use cases are different so there is no substitute for benchmarking and testing with realistic data and workloads if you want to be sure.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.