Setting up an elasticsearch cluster for the first time here. I need some advice on creating my index.
This index will store information about emails. The plan is to retain about 10 years worth of message data. I performed some analysis on the logs and it looks like the largest month we have had is around 250gb compressed (gunzip so I think that's DEFLATE), with the average being around 100gb compressed. So if that were on the high end, a total of 250 x 12 =3TB/year or 30TB for 10 years. Now, I'm probably not going to index everything here and how we are going to index this data is still trying to be understood. But I've learned that certain index settings are static and cannot be changed, so I was hoping to get the shard count right.
I currently have 10 nodes in my cluster that are data nodes, all with 64gb ram each with a 25gb heap size and 5TB of attached storage per node (so total 50TB). I also have 4 ingest nodes, 2 coordinating nodes, and 3 master nodes.
So here are some questions I have:
- Does it make sense to divide the data into multiple indices? Perhaps by month or year? Is there a performance increase?
- I've taken a look at https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster. One thing it mentions is "For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.". This suggests to me that one index for all of this data might be a bad idea, as if I had 30TB of data then 30,000GB / 40GB = 750 shards (assuming that there is 30TB of total email data and a shard size of 40GB).
- One thing I haven't taken into account yet are replica shards. If I have 50TB total and 30TB of data (again, this would be worst case), then I would need to have more disk space on my data notes, right? Also, how many replica shards is best for production?
Please feel free ask additional questions about my setup. I'm sorry if some of my assumptions are off here, I'm still learning so I appreciate your valuable input.