Scaling *always-hot* indices automatically? Multiple shards or indices?

I have a SaaS system where each client has their own separate index. There are about 400 indices on 8 nodes with a total of 5TB of data.

Data is not a time series, but always 'hot'. There are constant inserts, updates and deletes going on over the entire data set. 100% availability of the index is critical.

Each index has a different complex mapping of around 100-500 fields including a handful of nested fields.

I am running into issues scaling this for our larger clients. By default, we create indices with 2 shards and stick to the recommended max of 20GB per shard. This means that for our 10% biggest indices we need to manually set a bigger number of shards, at creation time. Since it is hard to predict during index creation how big it will grow over time, for existing indices this means we need to use the Split API which introduces downtime for that index and involves a lot of manual management.

I am looking for a way to scale each index automatically with minimal manual involvement. For that I was thinking of creating each index with a single shard, and create new indices once this rolls over a predetermined size (i.e. 20GB), much like the ILM or Rollover API provide.

However, there are a few caveats that prevent me from implementing it this way:

  • All documents remain 'hot' and can be updated or deleted at any time. There is no single write-index after a rollover, except for new documents.
  • Mappings are unique per index, and determined by the application. Having to manage these as templates in ES would be cumbersome.

Are there ways around these restraints? Especially the first one seems difficult. Either the application needs to know which specific index to send the request to, or the update/delete needs to be broadcast to all, and only processed if found (i.e. avoid upserts). Could guid's help out here? Right now we use arbitrary id's provided by external input.

What about pressure on the cluster state? Is there a difference between 4 indices with identical mappings and 1 shard each or 1 index with 4 shards? As far as I understand, for querying there should be little difference between both scenarios.

As an upper limit, I think it is safe to say we will not be reaching clients that would need more than 32 shards/indices in this scenario.

Would this be a good approach? Are there any viable alternatives? Are there any ways around the proposed caveats?

I can not think of any easy way to get around the caveats you described. I have seen a similar scheme before but that relied on looking up the index name externally for each update and insert.

Thanks! It is possible to work around them, indeed. But can you think of any conceptual problems with this idea? Especially this question, considering my mappings are pretty big:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.