Data Size: 6.8 TB
Monthly Increase: 150 GB
Time Period: 18 months
Calculation:
Total Increase in Data:
Total Increase=150GB/month×18months=2700GB=2.7TB
Total Data Size After 18 Months:
Final Data Size=6.8TB+2.7TB=9.5TB
Data Size: 6.8 TB
Monthly Increase: 150 GB
Time Period: 18 months
Calculation:
Total Increase in Data:
Total Increase=150GB/month×18months=2700GB=2.7TB
Total Data Size After 18 Months:
Final Data Size=6.8TB+2.7TB=9.5TB
Hi @AnushreeRaikar, welcome to our community.
have you seen this link with some sharding strategies?
There is not necessarily any fixed limit to how much data a single Elasticsearch can hold so this will depend on your data, hardware and requirements indexing throughput and query latency/concurrency. Recent versions have significantly reduced the amount of heap required so nodes can now hold a lot more data than they used to.
Assuming that the data volumes you mention are raw and that this translates 1:1 into indexed data on disk (simplified assumption as this will depend on you data and mappings) it is possible that a single node with 64GB RAM and a 30GB heap could hold the full data volume.
You did not mention anything about high availability. If you want this you need to have at least one replica shard configured, which will double your data size on disk across the cluster. You also need at least 3 master eligible nodes that can also be data nodes. In this case you would have 19TB spread out across 3 nodes, which is about 6.5TB per node.
The next factor that may force you to hold less data than this per node and therefore increase the total number of nodes in the cluster is requirements around query latencies and the number of concurrent queries you need to support. The number of CPU cores available and performance of the storage used play a big part here. It is difficult to estimate this so you will most likely need to run tests and benchmarks using realistic hardware, data and queries to estimate this.
For our cluster, we decided on the number of nodes based on the limitations/features of the Azure infrastructure we were using. There were two main considerations:
Point 1 and 2 will depend on your infrastructure and point 3 will depend on how you're using it.
For sure reindex and index splits are nerve-wracking (for me anyways), so I would definitely investigate whether you can divide your data into different indices (maybe using data streams?).
I have below Indices
Ind 1 : 597 MB
Ind2 : 195 GB
Ind3 : 3.2 TB
Ind4 : 29 GB
Ind5 : 235 MB
Is this the size of the indices including replicas or just the size of the primary shards?
If this is the total size of the indices including replicas you only have around 170GB per node at the moment, which is very little.
Here it depends on what type of hardware you have and what the current expected performance levels are. It would help if you could provide the specification of the nodes (CPU, RAM and type of storage used) as well as the load and expected performance.
If you have very slow storage it is possible you may hit IOPS limits (and thereby affect performance) quite quickly if you start adding the amounts of data you initially talked about.
[quote="AnushreeRaikar, post:5, topic:365133"]
Ind 1 : 597 MB
Ind2 : 195 GB
Ind3 : 3.2 TB
Ind4 : 29 GB
Ind5 : 235 MB
What type of storage are you using? What type of queries are you running? Is it primarily aggregations and Kibana dashboards or is it searches returning potentially large result sets? Are you using any more advanced features that could affect performance, e.g. parent-child relationships or nested documents?
This (CPU and RAM) sounds very high given the amount of data and the number of nodes. What is the load on the cluster? How many queries per second do you need to support? What is the expected query latency?
We have 170 active users and no more additions for the next years.
Why is the cluster currently so large? The current data set would if I calculate correctly fit on 4 nodes.
As per my calculation, 9 nodes + 3 master nodes are enough for 2 years. I wanted to be 100 % sure if I was right
What should be the cluster according to you?
You estimated a data volume of 9.5TB. If that does not include a replica you may need to double this.
As I said earlier I think the nodes look oversized. Given the light load I suspect 8 CPU cores and 64GB RAM per data node should be sufficient. I would also expect each node to be able to hold more data, likely at least double what they currently hold.
You have still not provided any details on the type of storage used, and as this often is the bottleneck it is hard to tell how much data each node likely can hold. You probably need to test yourself.
Gp3 is the storage used
OK. gp3 EBS can not compare to local SSDs but is likely faster than local HDD. You will need to test and see how much data you can put on each node while maintaining acceptable response times.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.