New Elastic Cluster Sizing/Data-Tiering

Hello everyone,

I work in a medium-sized company and have been responsible for optimizing and expanding the internal, on-premises elastic stack cluster since this year. The cluster has been in operation for years, but has not been optimally configured and managed so far. Together with a colleague, I took on the task of reorganizing the cluster to enable more efficient use and ensure long-term performance.

Our goal is to align the cluster for a higher data load, introduce a data tiering storage concept and distribute the various roles in the cluster in a targeted manner. Here is a summary of the current state, our planning and the necessary adjustments.


Current cluster state

Our Elastic Stack Cluster currently consists of five VMs, each with 12 CPU cores, 32 GB RAM and a total of 3.48 TB of SAN storage. The nodes are currently not specialized, but have all node roles (ML, hot data, ingest, master, coordinator, etc.), This leads to inefficient use of resources. In addition, we have about 17 TB of storage capacity available, of which approx. 13 TB are permanently occupied. In the future, many departments want to use Elastic with other applications or clusters – it is becoming a trend in-house :slight_smile:

Our cluster is used by about 100 to 150 developers over 53 spaces in Kibana, with a daily ingest of around 250 GB. Most departments use File or Winlogbeat to send log data to our Elastic. We also monitor around 80 servers using Metricbeat. The current Platinum license is used for each of the five server instances. We currently use around 40 ML jobs. Due to many departments with many use cases, data retention is restricted and the cluster often works at the limit.


Plan for the new cluster

To adapt the cluster to future requirements, we have developed a new cluster plan. The aim is to design the cluster for an ingest of up to 500 GB per day and implement two-stage data tiering to reduce the costs for SAN storage. With this calculation, we can process the data longer with less ingest. The budget allows for the use of five enterprise licenses, which enables the targeted distribution of RAM to license-required nodes. In addition, the plan was to implement two separate ML nodes in the new cluster.

Data tiering calculation

  1. Hot Phase Storage (primary storage for high performance):
  • Calculation: 500 GB ingest per day * 7 days * 2 * 1.3 (30% buffer) = ~9.1 TB.
  • Result: 10 TB for hot phase storage to provide stable capacity even at load peaks.This also takes into account buffers for critical metric data such as APM, SLOs and uptime data.
    2nd Second phase storage (secondary storage for long-term storage):
  • Calculation: 500 GB ingest per day * 21 days * 2 * 1.3 (30% buffer) = ~27.3 TB.
  • Result: 30 TB for the second phase of data tiering in order to cost-effectively ensure large storage space for data availability.

Planned cluster infrastructure / questions

At any rate, we want to use hot SAN storage for the first data tiering. However, the question arises which solution we want to implement as a second data tiering solution. We have the option of using an S3-capable storage unit and distributing it among several VMs and using it as warm data tiering storage. The alternative would be to use searchable snapshots and use frozen storage on S3 basis.

1.) Should hot data tiering be split over more than two nodes?

2.) What are the disadvantages of frozen data tiering compared to warm data tiering (hardware would be identical in both tiering).
->Are the data in Searchable Snapshots read-only only?
->Why is it not necessary to use a replica shard for searchable snapshots?

3.) Is it advisable to implement 2 separate ML nodes with only 5 ERUs?