Your 80 GB data per day may not be the same amount of data Elasticsearch ends up saving to disk, it depends on your mapping, sharding and other factors. So the first thing I would do is to run a simulation, indexing a full day's worth of data with the mapping and index settings you intend to use in production. That will give you a better grasp of the amount of disk you need.
10 data nodes with 512 GB disk gives you roughly 5 TB of disk space for data in the cluster, which doesn't sound enough for the use case you've listed above. Consider this:
If you're going to use 2 replicas per primary shard that means you need to triple the disk space from what you store in the primary shards.
As an example, let us say you actually save 80 GB of primary data to disk every day, then you also save 2 x 80 = 160 GB of replica data per day or a total of 240 GB to disk per day. And with 30 days per month that ends up at 240 GB x 30 = 7200 GB which is about 2 TB more than what you have available in a cluster with 10 data nodes. This clearly won't work.
Ideally you should never use up more than 70-80% of the disk space because that leaves you with no room for merging big shard segments or for re-indexing when you need to change a mapping. So if you aim for 7200 GB of data per month I would recommend a cluster with total disk space of at least 8000 GB as that would give you 800 GB or 10% free disk space when the cluster has stored one month of data. In that case you'll need 8000 / 512 = 16 data nodes of 512 GB.
Alternatively, if you reduce the replica factor to just 1 you'll need a lot less disk space (just 4800 GB for 30 days).
@Bernt_Rostad Thanks for this explanation. Actually data size is 80GB per day including replication. My bad, i did not mention that. So, we are allocating 50% more disk size.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.