Index design in a multitenant application

I need some help validating my idea of index design. We are in the planning stages of moving from version 5.6 to 8.x.

Background:
We got a postgres database that is the source of truth, we've split this postgres database up into different clusters appDb1, appDb2... etc when the app was built in the past the indexes in Elasticsearch got split up in an exact matching way so we got app1 index with about 7 different types and repeat that for each different index.

We are now looking at upgrading from 5.6 to the latest version and with this we'll need to rethink how we've designed the indices since multiple types per index are no longer a thing.

Our planned design:
This is where I need some help to validate if I'm thinking about it right or going completely wrong direction of the solution.

The new idea is to have a single index per type, in this case, we would start with 7 different indices, this however might lead to the indices having quite a lot of shards so I guess that custom routing would be a pretty good use case here(?). We would in that case use the application Id as a custom routing property.

There is also a risk here that ids on individual objects aren't unique, but combining appId+objectId as a string should make it unique.

I guess this is the initial thought we have. Have I missed something? Or am I maybe going the completely wrong direction with this solution?

Grouping data of the same type and mappings together is generally considered best practice, so this sounds reasonable.

Whether this is a good idea or not will depend on the data.

How many applications do you have? How is this expoected to grow over time?

What does the size distribution of applications look like? Are they all reasonably equal in volume? Does this depend on the type of data? if the differences in size is large, do you know in advance which applications will be large and small?

What is the current data volume in your cluster? How many primary and replica shards is that distributed across?

1 Like

Thanks for the response Christian

How many applications do you have? How is this expoected to grow over time?

Currently we are around 1200 applications and growing at a pace of 20% per year. But are also planning to expand in to new markets so this could increase in the future.

What does the size distribution of applications look like? Are they all reasonably equal in volume?

Customers that have been with us for a long time tend to grow in size. But there are no extreme size differences. Currently the largest index is about 16gb shared by about 380 applications. We generally don't know in advance which application will be larger or smaller.

What is the current data volume in your cluster? How many primary and replica shards is that distributed across?

Currently we have about 141gb and total 120 000 000 documents.
Amount of docs per type varies quite a bit, with most 29 000 000 and least 13000

Today each index have 5 primaries with 1 replica. This is in no way optimized since the indices also vary a lot in size.

The fact that you have a reasonably large number of applications and that the size is reasonably evenly distributed means routing might be a good and suitable option.

5 primary shards for small indices sounds reasonable. For somewhat larger ones I would consider 10 primary shards. If you have any indices/types that are much larger than the others you may want to consider 20 primary shards.

Aim to have a shard size of around 5GB to 10GB in size. This should allow you to grow without having to wory too much about shard size for a number of years.

1 Like