How many nodes are required for 10 TB of data?

Data Size: 6.8 TB
Monthly Increase: 150 GB
Time Period: 18 months
Calculation:
Total Increase in Data:

Total Increase=150GB/month×18months=2700GB=2.7TB
Total Data Size After 18 Months:

Final Data Size=6.8TB+2.7TB=9.5TB

Hi @AnushreeRaikar, welcome to our community.

have you seen this link with some sharding strategies?

There is not necessarily any fixed limit to how much data a single Elasticsearch can hold so this will depend on your data, hardware and requirements indexing throughput and query latency/concurrency. Recent versions have significantly reduced the amount of heap required so nodes can now hold a lot more data than they used to.

Assuming that the data volumes you mention are raw and that this translates 1:1 into indexed data on disk (simplified assumption as this will depend on you data and mappings) it is possible that a single node with 64GB RAM and a 30GB heap could hold the full data volume.

You did not mention anything about high availability. If you want this you need to have at least one replica shard configured, which will double your data size on disk across the cluster. You also need at least 3 master eligible nodes that can also be data nodes. In this case you would have 19TB spread out across 3 nodes, which is about 6.5TB per node.

The next factor that may force you to hold less data than this per node and therefore increase the total number of nodes in the cluster is requirements around query latencies and the number of concurrent queries you need to support. The number of CPU cores available and performance of the storage used play a big part here. It is difficult to estimate this so you will most likely need to run tests and benchmarks using realistic hardware, data and queries to estimate this.

For our cluster, we decided on the number of nodes based on the limitations/features of the Azure infrastructure we were using. There were two main considerations:

  1. Azure provided 3 availability zones, meaning that having a total of 2 replica shards (3 total) combined with appropriate routing settings would give us maximum redundancy. This means that our number of nodes will be a multiple of 3.
  2. Disks can be scaled-up by a factor of 2, but not scaled down. This means that we want to keep our disks small enough so that we can scale the cluster incrementally by adding smaller sized nodes inexpensively instead of doubling the size of every disk in the cluster.
  3. How many CPUs do we need in the cluster to serve the requests (IE - how many shards do we want on each node).

Point 1 and 2 will depend on your infrastructure and point 3 will depend on how you're using it.

For sure reindex and index splits are nerve-wracking (for me anyways), so I would definitely investigate whether you can divide your data into different indices (maybe using data streams?).

1 Like

I have below Indices

Ind 1 : 597 MB

Ind2 : 195 GB

Ind3 : 3.2 TB

Ind4 : 29 GB

Ind5 : 235 MB

Is this the size of the indices including replicas or just the size of the primary shards?

If this is the total size of the indices including replicas you only have around 170GB per node at the moment, which is very little.

Here it depends on what type of hardware you have and what the current expected performance levels are. It would help if you could provide the specification of the nodes (CPU, RAM and type of storage used) as well as the load and expected performance.

If you have very slow storage it is possible you may hit IOPS limits (and thereby affect performance) quite quickly if you start adding the amounts of data you initially talked about.

[quote="AnushreeRaikar, post:5, topic:365133"]
Ind 1 : 597 MB

Ind2 : 195 GB

Ind3 : 3.2 TB

Ind4 : 29 GB

Ind5 : 235 MB

What type of storage are you using? What type of queries are you running? Is it primarily aggregations and Kibana dashboards or is it searches returning potentially large result sets? Are you using any more advanced features that could affect performance, e.g. parent-child relationships or nested documents?

This (CPU and RAM) sounds very high given the amount of data and the number of nodes. What is the load on the cluster? How many queries per second do you need to support? What is the expected query latency?

We have 170 active users and no more additions for the next years.

Why is the cluster currently so large? The current data set would if I calculate correctly fit on 4 nodes.

As per my calculation, 9 nodes + 3 master nodes are enough for 2 years. I wanted to be 100 % sure if I was right

What should be the cluster according to you?

You estimated a data volume of 9.5TB. If that does not include a replica you may need to double this.

As I said earlier I think the nodes look oversized. Given the light load I suspect 8 CPU cores and 64GB RAM per data node should be sufficient. I would also expect each node to be able to hold more data, likely at least double what they currently hold.

You have still not provided any details on the type of storage used, and as this often is the bottleneck it is hard to tell how much data each node likely can hold. You probably need to test yourself.

1 Like

Gp3 is the storage used

OK. gp3 EBS can not compare to local SSDs but is likely faster than local HDD. You will need to test and see how much data you can put on each node while maintaining acceptable response times.

1 Like