I am setting up a cluster with the Elasticsearch BASIC version and one question I have is related to the memory: storage ratio.
I saw in a webinar (https://www.elastic.co/pdf/elasticsearch-sizing-and-capacity-planning.pdf) that the default ELK ratio (in HOT nodes) is 1:30. That is, with 1GB of memory, I can index 30GB of data on disk. Is this fixed, that is, taxative? Once the attribute (node.attr.data = hot) is defined, will Elastic use this rate? Or is this just an example of what a hot node should be?
According to Elastic's recommendation, the maximum memory that the Elastic JVM should have is 30GB. That is, if I have a 64GB node and leave 30GB for Elastic's JVM, the maximum it will be able to index is (30 * 30) 900GB of disk storage on a hot node?
Is the node.attr.data definition just for storage policy purposes (throwing hot data to a higher performing node)?
Assuming that I have a node with 64 GB of memory (30 for the Elastic JVM) and use Optane + NVME (high performance) disks for data storage, if I provision a 5TB disk for that node, Elastic will be able to use all available storage capacity (the ratio would be 1: 166)
This is not fixed at all and is just high level guidance. Hot nodes do a lot of work so a large portion of the heap is required for processing requests and indexing. This means that there usually is less heap space left to handle the indexed data which means hot nodes typically hold less data than other node types. Indexing is also very I/O intensive so there is a need to balance indexing and querying. The more data held on these nodes the more I/O will typically be required for querying as well. Depending on your use case and the type of data and queries combined with the indexing and query load you may find that the optimum for you is more or less than the general guideline.
As described above this will depend on the use case. I have seen use cases where the optimal ratio for hot nodes have been significantly lower than 1:30.
This is used to define which tier the each node belongs to.
For warm and cold nodes that do not perform any indexing there are typically two factors that determine how much data a node can hold: heap usage and query latency requirements.
Indexed data on disk requires a certain amount of heap space and the amount will depend on version of Elasticsearch as well as your data, index settings, mappings and shard size. This has been improved in the most recent versions of Elasticsearch so although it used to be the main limiting factor that may no longer be the case.
The more data a node holds the more disk I/O will also be required to serve queries. Query latency requirements may therefore also limit how much a data node can be allowed to hold. There is no fixed limit so nodes may hold a lot of data under the right circumstances. You will need to test to find out what is right for you.