Quick description: I'm dealing with a scenario where we have over 10 billion records, requiring storage at the petabyte level. Following the Elasticsearch official best practices, which recommend keeping each shard between 20-50GB (let's assume 40GB for calculation), we might end up needing around 10,000 shards. I'm concerned that having so many shards could lead to issues.
We are using the routing parameter, meaning each request is directed to a specific shard for querying, which in theory should not be affected by the number of shards. Thus, I'm considering two approaches:
Splitting the index into multiple ones, for example, creating 100 indexes with 100 shards each. This approach might increase implementation and maintenance costs due to the high number of indexes.
Keeping a single index with 10,000 shards, which seems simpler to implement. However, I'm unsure about the potential impacts of having such a high number of shards.
Which strategy do you think is more reasonable? I would appreciate any advice or suggestions. Thank you.
If you are looking at petabyte scale data it may make sense to split it into a number of large clusters instead of a single very large cluster. There is no defined max cluster size as far as I know, but you are likely to at some point start hitting scalability issues if you intend to scale out a single cluster infinitely. I do not know at what point this will occur as it depends on the version of Elasticsearch you are using as well as the specifics of the use case. As you are indexing, updating and querying by routing you should be able to rout to specific clusters as well in your application logic. I would therefore recommend splitting the data into multiple indices to support this. If you want to query all data you can still do so if you set up cross cluster search.
Even if you do not want to go down the path of multiple clusters as this point I would recommend having multiple indices so you do not need complex reindexing if you in the future need to split into multiple clusters.
Not to totally contradict what Christian says, (there certainly are limits to how large a single cluster can be) but 10k shards is not a particularly worrying number in modern Elasticsearch.
IIRC there's a limit of something like 1000 or 1024 shards per index. Tho I'm not aware of anything bad that happens at that level, but it might give you more flexibility to have more indices, e.g. so you can add replicas to some hot subset of your data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.