Hi all. I can find lots of information on shard limitations per node, but I'm struggling to find any guidance on a limit of shards cluster wide. We have an application where between intelligent index selection and shard routing we could benefit from having potentially hundreds of thousands of shards in our cluster. We certainly wouldn't be querying them all at once. However, I'm trying to create a test cluster and have hit a severe bottleneck with shard allocation. It is taking what seems to be exponentially longer to create shards as I have approached 25k total. The bottleneck seems to be a single thread/core on the master node. Are there ways to overcome this bottleneck? We wouldn't want node recovery or backup restorations to take days in our cluster. Thanks.
Cluster state changes are single threaded on the master because it makes it easier to know that they are done correctly. The idea is that cluster state shouldn't change too much so parallelizing it isn't a good idea. There have been lots of improvements though:
- In 1.x after each change the whole state had to be sent to each node. In 2.0+ Elasticsearch sends diffs. We used to saturate the NIC on the master on large clusters....
- Later in 2.x Elasticsearch started batching changes to the cluster state so it could make a pile of changes in one go and then sync it out.
The reason we don't spend even more time on this is that having tons of tiny shards doesn't tend to buy much performance. Elasticsearch is fairly good at picking out individual results from larger indexes: queries do this very well, grabbing the actual values is very fast in aggregations and good enough when returning the _source of dozens/hundreds of documents at a time.
Each shard has some non-trivial memory overhead too. It is worth experimenting with, but as it stands now you really should avoid things like having an index per customer or an index hour with six months of index retention. Even reasonable sounding things like having an index per language doesn't work well if you do the combinatorial explosion thing and have an index per language per document type.
So, yeah, you've found a number of index and/or shards that is too large.
Thanks for the quick reply and helpful insight. I thought I would share a little more about how we came to this approach. We are targeting 200ms response times per shard. We found that we could accomplish this for our particular queries/aggs with 300mb shard sizes. We have over 2B docs in daily indices for the last several years. Choosing a number of shards that targets 300mb each yields tens of thousands of shards. Routing keys ensure we aren't hitting every shard per index and our queries are typically much less than 1 year of time coverage.