Scaling always-hot indices automatically? Multiple shards or indices?

cdekker · May 25, 2019, 2:40pm

I have a SaaS system where each client has their own separate index. There are about 400 indices on 8 nodes with a total of 5TB of data.

Data is not a time series, but always 'hot'. There are constant inserts, updates and deletes going on over the entire data set. 100% availability of the index is critical.

Each index has a different complex mapping of around 100-500 fields including a handful of nested fields.

I am running into issues scaling this for our larger clients. By default, we create indices with 2 shards and stick to the recommended max of 20GB per shard. This means that for our 10% biggest indices we need to manually set a bigger number of shards, at creation time. Since it is hard to predict during index creation how big it will grow over time, for existing indices this means we need to use the Split API which introduces downtime for that index and involves a lot of manual management.

I am looking for a way to scale each index automatically with minimal manual involvement. For that I was thinking of creating each index with a single shard, and create new indices once this rolls over a predetermined size (i.e. 20GB), much like the ILM or Rollover API provide.

However, there are a few caveats that prevent me from implementing it this way:

All documents remain 'hot' and can be updated or deleted at any time. There is no single write-index after a rollover, except for new documents.
Mappings are unique per index, and determined by the application. Having to manage these as templates in ES would be cumbersome.

Are there ways around these restraints? Especially the first one seems difficult. Either the application needs to know which specific index to send the request to, or the update/delete needs to be broadcast to all, and only processed if found (i.e. avoid upserts). Could guid's help out here? Right now we use arbitrary id's provided by external input.

What about pressure on the cluster state? Is there a difference between 4 indices with identical mappings and 1 shard each or 1 index with 4 shards? As far as I understand, for querying there should be little difference between both scenarios.

As an upper limit, I think it is safe to say we will not be reaching clients that would need more than 32 shards/indices in this scenario.

Would this be a good approach? Are there any viable alternatives? Are there any ways around the proposed caveats?

Christian_Dahlqvist · May 26, 2019, 5:32pm

I can not think of any easy way to get around the caveats you described. I have seen a similar scheme before but that relied on looking up the index name externally for each update and insert.

cdekker · May 29, 2019, 8:56am

Thanks! It is possible to work around them, indeed. But can you think of any conceptual problems with this idea? Especially this question, considering my mappings are pretty big:

system · June 26, 2019, 9:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does every index have its own shard? Elasticsearch	11	784	April 10, 2023
Too many smaller indices. shards is creating issue Elasticsearch	20	227	October 23, 2024
Auto split an index's shard when certain size reached Elasticsearch ilm-index-lifecycle-management	2	867	October 27, 2021
Automatically increase number of shards Elasticsearch	6	689	November 2, 2022
New feature request? Automatic index scaling using aliases without over-allocations Elasticsearch	2	376	July 6, 2017

Scaling *always-hot* indices automatically? Multiple shards or indices?

Related topics

Scaling always-hot indices automatically? Multiple shards or indices?