We are planning to have 1 index which will host ~150TB of data. Ideally it is suggested to have 1 shard per 50GB. So i may end up in 3000 Shards!!!
does ES really works with 3000 shards on 1 index? We don't have problem with resources like cores,memory and disk...We can scale any number of Elastic search nodes.
All we concerned is...Does ES supports at this scale? any standard guidelines for number of shards allowed per index?
Please help us on this for better decision making of our product.
Can you explain your use-case a bit more? What would be the reason of putting everything into a single index? How are your queries structured?
In general, the 50GB per shard rule is not a strict rule, but a recommendation. Before having thousands of shards, it makes sense to bend this rule.
In general a shard is just a lucene index. It does not matter if you query 3000 shards across 3000 indices or 3000 shards across a single index, the same work would need to be done in the background. That's one of the reasons why it makes sense to keep the number of shards low.
Basically we have 1 index where customer keeps uploading their avg 1MB doc. We are expecting almost 150TB data in 2 years. Some how we cannot distribute that to too many indices and has to keep only 1 index. And we keep this data forever until customer deletes their files.
Actually we just tried a run keeping 1 index with 1800 shards. ingested ~6TB of data. Now we see that we cannot go more than 15searches/sec. We have 6 data nodes each of 8 cores. With just 15searches/sec, we are seeing search threads are hitting max. as per elastic search stats, query rate is touching 30K/sec which is very huge for just 15 searches/sec.
That is where we are trying to find what is optimal shards we can keep.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.