Design practice to prevent oversharding

Hi there,

I'm strugling with an application design and the impact that it has on the cluster.

Let me explain the use case:

My customers can create databases inside my application to store and retrieve data.

Every time a customer stores a database, I create a new indice with a corresponding mapping of what the customer created (the user has a simple interface WYSIWYG and no clue about the mapping, this is generated from our side).
Those are quite "simple" databases (columns, values,...), but customers want each to have their own structure.

The issue with this simplistic approach is that every time a customer creates a "database" in our product, this create a new indice, and the shard count is adding up and this become very costly.

So, I think this is bad practice to do that, but I can't figure out what are the alternatives.

I found out about filtered aliases, but since the mappings can change because customers don't share the same data structure, I think it won't work putting everything inside one big indice.

I'm out of ideas, the current implementation works fine but the number of shards will get out of control very quickly if we keep doing that.

Also to provide an indicator, customer managed databases can vary in size but we expect some to reach hundred of millions of documents at most while others will hardly reach 100K documents, but we can't predict that.

Thanks for your help

You will know this before you create the mapping and index right ?

If yes you can design your sharding strategy accordingly.

For most of your indices one shard will be enough , only for large indices you can go for multi shards.

How many indices and shards you have as of now and how much growth you expect on monthly basis?

Hi @DineshNaik ,

Sadly I can't.
A customer will register, create and start using the product (SaaS product), they don't get in touch with us when they use the product, the generated mapping will depend of what the customer will design (how many columns, what type of data, etc..).

I already have set the primary shards count to 1 for our application generated indices but the developers are arguing that it will likely not be enough if the indice grows over time and reach millions of documents, so they are asking for 2 shards for each database.

Currently I have no idea of how many simultaneous queries (writes/reads) one shard can be expected to handle on a medium-sized aws instance like I3.L/XL (define "large indice").

But even with 1 shard/customer database, with 2 replicas to ensure minimum resiliency, this leads to 3 shards per customer database.
If we reach a mere 10 customers/month for example with each an average of 5 databases/customer, at the end of the year, this will result in 1800 (primaries+replicas) shards, this is big infrastructure costs here.
And the fun part is that we will likely have customers that enabled their databases, played a little and left it with some documents inside, consuming precious system resources for nothing.

That's why I was wondering if we are or not making a design mistake in our Elasticsearch usage.

Unless these are huge indices, then you can run that on 3 nodes.

Alternatively, you can use just 1 replica.

Since each shard costs system resources, and the official recommandations are 1GB of heap memory for 20 shards, that will be 3 very big instances :confused:

But I digress, the main concern was the design, creating one indice for each customer DB seems overkill and I was wondering if I'm not missing out something that can allow to achieve the same thing using far fewer resources.

If that's not the case then you're right, we have to keep up with the growing shard count.

you can always have a monitoring and cleanup approach, say no read/writes for some indices for 7 days (can differ based on your use case) , then you can delete those indices altogether so that your resources are billed for actual usage!

That's a good idea but not applicable in my case, the customer data lives as long as the customer exists (and thus, can be used at any time), we can't delete their data after X days.

So, trying to stick to my initial question, doing one indice per customer database is the only choice we got because of the mappings ?

Do you have an idea of how many fields each index have? This is what will impact in your mappings.

Also, the database creates by one customer normally share some fields?

Instead of one index per database per customer, you could try to use one index per customer and use a custom field to filter the databases, this will probably need some changes in your application logic.

The size of the documents is more important than the number of documents, you can have a small index with 100 millions of documents and a big index with just a couple of thousands, so you need to focus on the size of index instead of the document numbers.

You could use some ILM policy to rotate your index when it reaches an specific size, this also helps you to avoid oversharding. The ideal size for a shard is something near ~ 50 GB.

Yes, we limit the number of fields to 30, and at average we expect about 15 fields per database (the system is not in production, those are all assumptions).

That's a good question, I would say "it depends", each customer has it's own needs, but considering that we could use one index per customer, with at most 30 fields per database and 30 databases per customer, this would result in a mapping of 900 fields, isn't it too much?

For the size i would say something like 1-10KB per document, maybe some customers will put big json documents resulting in a 50KB document but it will be rare.

I'm asking all those questions because in the past we've had bad experiences with a small cluster that had 10K documents and like 1K shards, this resulted in occasional crashes and hiccups, lesson learned :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.