Hot/Warm Architecture with uniform hardware?

Is there any benefit to (or any problems with) data tiering if all nodes use the same type of storage media? In the past, I’ve seen suggestions to transition to a hot/warm/cold architecture for different storage media - SSDs, local HDDs, network-attached HDDs, for example.

Because we only have one type of storage available, our cluster has historically only used hot tier nodes. We're exploring tiering as a way to adjust the RAM:Disk ratios for our nodes - this would allow us to increase cluster storage without increasing the total number of data nodes to manage.

For background, our cluster looks like the following:

  • VMs with network-attached storage (enterprise-grade HDDs)
  • 3 master nodes (16GB RAM each), 2 ingest nodes, and 15 data nodes (2TB storage; 32GB RAM each; 8GB JVM Heap).
  • Cluster Disk Available: 5.5TB / 28.2TB
  • Typical monthly ingest is approx. 2-4TB, including shard replicas

Our current thoughts with tiering would adjust the data node specs as follows:

Hot Tiers:

  • 2TB storage
  • 32GB RAM
  • 16GB JVM heap

Warm Tier:

  • 5TB storage
  • 32GB RAM
  • 8GB JVM

Does anyone see any issues with this? Is tiering only recommended if you have different storage media available?

What is driving this suggested change in cluster topology? What is the problem you are looking to solve?

Which version of Elasticsearch are you using?

What is your retention period(s) for the data?

How is the performance of the cluster as it is currently configured?

How many nodes of the different types would the proposed hot-warm topology have?

Would this change coincide with any change of ingest volume or retention period?

6 of the 15 nodes are above the low watermark. We've decided to increase the overall cluster storage to better support occasional logging spikes, improve time spent rebalancing, and reduce search latency associated with high disk util.

8.17.3 with plans to upgrade to 9.x as part of our cluster redesign

Everyday performance is generally fine. Search latency becomes noticeable if 7 or more nodes are above the low disk watermark threshold, sometimes causing nodes to leave cluster and triggering rebalancing.

Any maintenance or rebalancing takes significant amounts of time (multiple days) - I suspect due to the low disk availability. Search and index latency is high during this time, impacting ability to use the database.

We haven't fully mapped it out, as we aren't sure if hot/warm is even suitable with uniform storage hardware. We would likely aim for 3-4 hot nodes and 4-5 warm nodes based on desired cluster storage.

No plans to change either at this time.

This is not ideal as Elasticsearch is a quite I/O intensive data store. A lot of the issues around relocating shards and rebalancing may very well be caused by poor storage performance.

If all nodes have the same slow storage I do not see any benefit of adopting a hot-warm architecture as the fewer hot nodes likely will get even more overloaded.

With the current limitations on hardware and storage I would probably recommend keeping all data nodes of the same specification, but increase storage to limit the need for rebalancing. As indexing is the most I/O intensive process I would probably also recommend adjusting the primary shard count of the indices you are actively indexing into and make sure that these are as evenly distributed across the cluster as possible. If you are indexing into multiple indices you may want to ensure that you keep the number of shards actively being indexed into low on all nodes.

The best way to resolve performance and stability issues would likely be to change to more performant storage.