Setting up an elasticsearch cluster for the first time here. I need some advice on creating my index.
This index will store information about emails. The plan is to retain about 10 years worth of message data. I performed some analysis on the logs and it looks like the largest month we have had is around 250gb compressed (gunzip so I think that's DEFLATE), with the average being around 100gb compressed. So if that were on the high end, a total of 250 x 12 =3TB/year or 30TB for 10 years. Now, I'm probably not going to index everything here and how we are going to index this data is still trying to be understood. But I've learned that certain index settings are static and cannot be changed, so I was hoping to get the shard count right.
I currently have 10 nodes in my cluster that are data nodes, all with 64gb ram each with a 25gb heap size and 5TB of attached storage per node (so total 50TB). I also have 4 ingest nodes, 2 coordinating nodes, and 3 master nodes.
So here are some questions I have:
Does it make sense to divide the data into multiple indices? Perhaps by month or year? Is there a performance increase?
I've taken a look at https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster. One thing it mentions is "For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.". This suggests to me that one index for all of this data might be a bad idea, as if I had 30TB of data then 30,000GB / 40GB = 750 shards (assuming that there is 30TB of total email data and a shard size of 40GB).
One thing I haven't taken into account yet are replica shards. If I have 50TB total and 30TB of data (again, this would be worst case), then I would need to have more disk space on my data notes, right? Also, how many replica shards is best for production?
Please feel free ask additional questions about my setup. I'm sorry if some of my assumptions are off here, I'm still learning so I appreciate your valuable input.
lets divide by 4 (weekly index) = 62gig, per index (index_name-year-01 to 52 )
and each index with two shard, each is 30gig. (well within limit)
you will have 52 index * 2 = 104 shard, 104 replica - per year
10 year you will have 1040 shard and 1040 replica. all doable.
now this will make easy if you want to backup some index and remove it. for now you saying you need for 10 year. but may be after few year business requirement gets change and you can remove older data. this will make easy to drop index.
So using the example that elasticforme gave, what would be the benefit/con of using 4 shards (or 8, or 100) per weekly index vs 2? Can you help me understand the tradeoff? I think that's really the meat of the question.
more shard will use more memory, cluster need to manage all that shard
more is not good. and bigger then 30gig shard is not good either.
so many smaller shard is not good either. it won't give you any speed.
TLDR - it depends. On your ingest rate, your query rate, the sorts of queries you do on the size of the doc.
I would start with a ~40 gig shard size and go from there. It may make sense to increase shard size for archived data for eg. I would strongly recommend the use of ILM as well.
You can't directly specify shard size, right? This would be based on the number of shards and how the amount of data in the index is divided, correct? So a 100GB index with 4 shards = 25gb/shard?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.