Shard Count and Index Splitting Strategies about PB-Level Storage in Elasticsearch

Quick description: I'm dealing with a scenario where we have over 10 billion records, requiring storage at the petabyte level. Following the Elasticsearch official best practices, which recommend keeping each shard between 20-50GB (let's assume 40GB for calculation), we might end up needing around 10,000 shards. I'm concerned that having so many shards could lead to issues.

We are using the routing parameter, meaning each request is directed to a specific shard for querying, which in theory should not be affected by the number of shards. Thus, I'm considering two approaches:

  1. Splitting the index into multiple ones, for example, creating 100 indexes with 100 shards each. This approach might increase implementation and maintenance costs due to the high number of indexes.
  2. Keeping a single index with 10,000 shards, which seems simpler to implement. However, I'm unsure about the potential impacts of having such a high number of shards.

Which strategy do you think is more reasonable? I would appreciate any advice or suggestions. Thank you.

note: elasticsearch version 8.2.3

Reference: Size your shards

If you are looking at petabyte scale data it may make sense to split it into a number of large clusters instead of a single very large cluster. There is no defined max cluster size as far as I know, but you are likely to at some point start hitting scalability issues if you intend to scale out a single cluster infinitely. I do not know at what point this will occur as it depends on the version of Elasticsearch you are using as well as the specifics of the use case. As you are indexing, updating and querying by routing you should be able to rout to specific clusters as well in your application logic. I would therefore recommend splitting the data into multiple indices to support this. If you want to query all data you can still do so if you set up cross cluster search.

Even if you do not want to go down the path of multiple clusters as this point I would recommend having multiple indices so you do not need complex reindexing if you in the future need to split into multiple clusters.

1 Like

Not to totally contradict what Christian says, (there certainly are limits to how large a single cluster can be) but 10k shards is not a particularly worrying number in modern Elasticsearch.

IIRC there's a limit of something like 1000 or 1024 shards per index. Tho I'm not aware of anything bad that happens at that level, but it might give you more flexibility to have more indices, e.g. so you can add replicas to some hot subset of your data.