Standard rule of thumb to determine when to go from 1 shard to 2 and so on?

What is a good rule of thumb for determining when to shard an elasticsearch index? I can't seem to find a definitive answer on how to decide this. For an ecommerce driven database is there a standard formula for determining when to do this? Most of our data is ecommerce data like orders, products, customers with the majority of i/o being reads.

Over time we'll have 1000's of user based indices, which will most likely all start with 1 shard and have <20K documents initially.

That isn't likely to work in a single cluster. And index per user is fine if you have a dozen huge users but small users probably need to live in an index with other small users. The overhead of indexes just doesn't scale like that.

Wow! nik9000, thanks for the response. I was just reading about your work on the reindexing abilities in 2.3.

So I guess we'll take the Faking Index per User with Aliases route for small users and keep them in a shared system with overallocated shards, and need to figure out when to split out One Big User into their own index.

From this article it sounds like for my "fake index per user with aliases" shared index if I have 4 nodes, then I should initially have 4 shards? Makes sense, but seems too simple. This article on overallocation suggests if I'm reading it right that if I have 4 nodes I might want to create my shared user index with 6 shards. So when I add more nodes, because I overallocated, there is no reindex required and new nodes, will simply start being used.

I think I'm looking for too simple of an answer probably for my "One Big User" scenario and need to spend more time in Designing for Scale docs.

However, if someone has a good rule of thumb in one paragraph or less for determining scale for the "One Big User" or my shared index I'd be interested to hear thoughts.

In addition to the provided information, in theory, each shard can hold up to 2B documents (assuming there is no sub-document(s) per document) depending on your data flow, you can estimate how many shards you'll need per index. Even though each shard can hold 2B documents, I would not suggest to fill it up, you'll need to do the performance test based on your data and HWs to determine a reasonable configuration.

Here's my first phase plan then based on the feedback. I'm just looking for a thumbs up that I'm heading in the right direction. I also recognize the best answer is here in the Capacity Planning section.

Problem: We are going to have 1000's of users whose data for all intents and purposes is not to be "shared" with any other user accessing our app interface. Most users will have very small data, but they might grow over time, and then some users from the outset might have a lot of data.

First phase Solution:

In a 4 node cluster:

  • Overallocate with 6 shards initially for our shared index. For fast unexpected growth we can add 2 nodes and not have to reindex.

  • Most users will have small amounts of data, which would benefit from Faking Index per User with Aliases route

    • Faking index per user, would give small users the advantage of routing and not suffer because they're in a 4 node 6 shard cluster.

    • However, having too many aliases can lead to overhead, which would negate benefits of routing? Any thoughts on this?

    • Using aliases, would also allow us to easily split out a formerly small user, which is now a big user, into a new index that would take advantage of our 6 shards with zero downtime.

  • We'll monitor each shard, and when they reach around 5GB, we'll open up 2 more nodes, which will scale automatically because we started with 6 shards.

We've tested the use of aliases in our test environment and I would say, if you have a reasonable number of aliases (or users in this case) it works really well. If your users grow from 1K to 2K for example, and you can't add more HWs, it would hurt the performance tremendously. So if you are aiming for 1K users, you should test your setup with at least 2K users to see if the performance is acceptable.

You mentioned about adding additional 2 servers a lot. If you think you can do a 6-node cluster now, then go for it, don't wait.

Since I don't know much about your data to provide anything useful here but make sure the physical disk(s) are large enough to hold the index and the replicas.