we are planning to create a Prod Elasticsearch cluster of "2" data nodes and "3" master nodes. we expect to have total 22 M documents to index where we have one single index. what is the optimal number/size of shards to use? by the way does master nodes counts toward #of shards or only data nodes?
Master nodes can be small and do not store any datas. Number of shards for your index depends of the total size of your 22M documents after indexation. My rule of tumb is ~10Go per shard.
Thanks for providing the info. doc size varies in terms of kb (i.e. 10kb - 400kb). the estimated total storage we expect is 4TB. so in this case should we calculate # of shards based on data nodes ? for example if we have 3 master nodes and 2 data nodes, so in this case can we consider # of shards 2* 2 = 4 shards?
You should try to index some documents and look at the size, because it depends on what type of datas you are going to index (integer, string, keywords, synonymes, etc...)
For exemple, I have a 130M documents indexed having a size of 32Go. Number of shards in my case si 8 primaries. So each shard is about 4Go.
With 2 data nodes, If you have 1 index with 8 shards, each node will hold 4 shards. If you add 1 replica, they will hold 8 shards. Make some tests.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.