How many nodes are required for 10 TB of data?

AnushreeRaikar · August 19, 2024, 11:33am

Data Size: 6.8 TB
Monthly Increase: 150 GB
Time Period: 18 months
Calculation:
Total Increase in Data:

Total Increase=150GB/month×18months=2700GB=2.7TB
Total Data Size After 18 Months:

Final Data Size=6.8TB+2.7TB=9.5TB

Alex_Salgado-Elastic · August 19, 2024, 12:34pm

Hi @AnushreeRaikar, welcome to our community.

have you seen this link with some sharding strategies?

Christian_Dahlqvist · August 19, 2024, 2:28pm

There is not necessarily any fixed limit to how much data a single Elasticsearch can hold so this will depend on your data, hardware and requirements indexing throughput and query latency/concurrency. Recent versions have significantly reduced the amount of heap required so nodes can now hold a lot more data than they used to.

Assuming that the data volumes you mention are raw and that this translates 1:1 into indexed data on disk (simplified assumption as this will depend on you data and mappings) it is possible that a single node with 64GB RAM and a 30GB heap could hold the full data volume.

You did not mention anything about high availability. If you want this you need to have at least one replica shard configured, which will double your data size on disk across the cluster. You also need at least 3 master eligible nodes that can also be data nodes. In this case you would have 19TB spread out across 3 nodes, which is about 6.5TB per node.

The next factor that may force you to hold less data than this per node and therefore increase the total number of nodes in the cluster is requirements around query latencies and the number of concurrent queries you need to support. The number of CPU cores available and performance of the storage used play a big part here. It is difficult to estimate this so you will most likely need to run tests and benchmarks using realistic hardware, data and queries to estimate this.

Ian_Simpson · August 19, 2024, 5:41pm

For our cluster, we decided on the number of nodes based on the limitations/features of the Azure infrastructure we were using. There were two main considerations:

Azure provided 3 availability zones, meaning that having a total of 2 replica shards (3 total) combined with appropriate routing settings would give us maximum redundancy. This means that our number of nodes will be a multiple of 3.
Disks can be scaled-up by a factor of 2, but not scaled down. This means that we want to keep our disks small enough so that we can scale the cluster incrementally by adding smaller sized nodes inexpensively instead of doubling the size of every disk in the cluster.
How many CPUs do we need in the cluster to serve the requests (IE - how many shards do we want on each node).

Point 1 and 2 will depend on your infrastructure and point 3 will depend on how you're using it.

For sure reindex and index splits are nerve-wracking (for me anyways), so I would definitely investigate whether you can divide your data into different indices (maybe using data streams?).

AnushreeRaikar · August 20, 2024, 12:43am

I have below Indices

Ind 1 : 597 MB

Ind2 : 195 GB

Ind3 : 3.2 TB

Ind4 : 29 GB

Ind5 : 235 MB

Christian_Dahlqvist · August 20, 2024, 5:37am

Is this the size of the indices including replicas or just the size of the primary shards?

If this is the total size of the indices including replicas you only have around 170GB per node at the moment, which is very little.

Here it depends on what type of hardware you have and what the current expected performance levels are. It would help if you could provide the specification of the nodes (CPU, RAM and type of storage used) as well as the load and expected performance.

If you have very slow storage it is possible you may hit IOPS limits (and thereby affect performance) quite quickly if you start adding the amounts of data you initially talked about.

AnushreeRaikar · August 20, 2024, 7:16am

[quote="AnushreeRaikar, post:5, topic:365133"]
Ind 1 : 597 MB

Ind2 : 195 GB

Ind3 : 3.2 TB

Ind4 : 29 GB

Ind5 : 235 MB

Christian_Dahlqvist · August 20, 2024, 7:27am

What type of storage are you using? What type of queries are you running? Is it primarily aggregations and Kibana dashboards or is it searches returning potentially large result sets? Are you using any more advanced features that could affect performance, e.g. parent-child relationships or nested documents?

This (CPU and RAM) sounds very high given the amount of data and the number of nodes. What is the load on the cluster? How many queries per second do you need to support? What is the expected query latency?

AnushreeRaikar · August 20, 2024, 8:25am

We have 170 active users and no more additions for the next years.

Christian_Dahlqvist · August 20, 2024, 8:36am

Why is the cluster currently so large? The current data set would if I calculate correctly fit on 4 nodes.

AnushreeRaikar · August 20, 2024, 8:47am

As per my calculation, 9 nodes + 3 master nodes are enough for 2 years. I wanted to be 100 % sure if I was right

AnushreeRaikar · August 20, 2024, 8:48am

What should be the cluster according to you?

Christian_Dahlqvist · August 20, 2024, 9:00am

You estimated a data volume of 9.5TB. If that does not include a replica you may need to double this.

As I said earlier I think the nodes look oversized. Given the light load I suspect 8 CPU cores and 64GB RAM per data node should be sufficient. I would also expect each node to be able to hold more data, likely at least double what they currently hold.

You have still not provided any details on the type of storage used, and as this often is the bottleneck it is hard to tell how much data each node likely can hold. You probably need to test yourself.

AnushreeRaikar · August 20, 2024, 9:17am

Gp3 is the storage used

Christian_Dahlqvist · August 20, 2024, 9:48am

OK. gp3 EBS can not compare to local SSDs but is likely faster than local HDD. You will need to test and see how much data you can put on each node while maintaining acceptable response times.

Topic		Replies	Views
One single node can support 80TB data to search? Elasticsearch	3	551	May 31, 2018
10 billion records writen to ES ervery day,how many nodes and hardware should need? Elasticsearch	7	6814	July 5, 2017
Elasticsearch Large Cluster Configuration Requirments Elasticsearch	5	374	June 21, 2022
Not able to figure out number of shards in my cluster Elasticsearch	5	387	October 14, 2019
Elasticsearch Shards Elasticsearch	5	701	August 22, 2017

How many nodes are required for 10 TB of data?

Related topics