Shard distribuition

In our cluster we have a problem where the nature of the data means that some indices will have 80 primary shards each 10gb large, which other indices will have a single primary shard that is only 500mb.

My experience with Elasticsearch has been that this can lead to the cluster making some poor balancing decisions, as it seems to want to treat all shards as equal. Recently a new node joined our cluster and it ended up being given a disproportionate number of the large shards. The end result was that this node had just over half the number of shards as every other node in the cluster (which all only vary by a single shard) but over twice the storage usage (enough to push it past our low watermark).

There are some things we do to combat this (we are already set up with forced awareness, and we do not have problems with data loss). We've tweaked the index balance weights to favor an even distribution of indices over shards. And all of our indices are set up to allow no more shards per node than is necessary for our cluster.

But neither of these solutions work well when you have new indices created on an hourly or daily basis. Because even if every index is only allowed to have a single shard on each node, if you have an hourly index over several days a single node could end up with dozens of shards from essentially the same index.

I'm looking for a better solution. Re-indexing those hourly indices into daily or daily into weekly would help, but it is not really an option for us at the moment. Re-indexing is very expensive and our current infrastructure does not really allow it. Our smallest shards are already as big as we can make them, and our larger shards are not even up to the size most people online seem to aim for. Despite that, we could increase those shard counts even further to get the size variance down.

What I would really like is a way to configure balancing rules across multiple indices (or an alias). To be able to say that: "Between index x, y, and z there can only be one shard per node."

One approach is to have multiple clusters to avoid mixing different shard sizes. Another option is to effectively partition your cluster into virtual subclusters using shard allocation filtering, by assigning indices with small shards to certain nodes and indices with large shards to other nodes.

What I would really like is a way to configure balancing rules across multiple indices (or an alias). To be able to say that: "Between index x, y, and z there can only be one shard per node."

We are thinking about a similar feature, see Suboptimal shard allocation with imbalanced shards · Issue #17213 · elastic/elasticsearch · GitHub
Do you think that that feature would solve your problem?

Thanks for the suggestions. We might experiment with trying to manually segment the data as you suggested. But it does take away some of the benefits of having a distributed solution like ES. Our biggest shards usually end up being the ones with the highest ingestion pressure, and the most expensive queries; and being able to have every node at our disposal help distribute that load is ideal.

The shard_weight/shard_resource_cost idea that you linked to does sound like a partial solution. Although I'm not sure why, in their comment, they wanted to steer away from the idea of this being used for balancing. My problem (and the other issues linked to in that thread) essentially boil down to the cluster not understanding how to properly balance itself, because it only gives you two balancing heuristics to configure.

Because, 50 1gb shards are not the same as 1 50gb shard (especially if shard size is mostly ignored when balancing). And a shard that is actively being written to is very different from one that is read_only.

The suggested fix does fall short in one area though. I would still not have a way to get ES to treat shards from a timeseries of hourly indices as if they were from the same index (which they essentially are when it comes to querying). So long as I only have one or two index series whose shards have a high "resource cost" it would work fairly well. But after that you could still end up with one node taking on more indices from an index series than makes sense. Because the reality is that I am going to query a week's worth of these indices at once and I don't want an individual node being responsible for a disproportionate amount of that work.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.