Is the nodes load a criteria for shard allocation?

Hi,

I would like to know if Elasticsearch is taking the load of the nodes to decide how to reballance shards? I have seen cluster.routing.allocation.balance.write_load in the documentation, but I'm not sure how that works. I've read that the estimated write load is calculated for data streams only and is 0 otherwise, but I'm not using data_streams so is the write load not taking into consideration to rebalance for my case?

For context my actual configuration is a 12 nodes cluster in which I reindexed a lot of data in a week from an old cluster. My indices are timed because I store and continuously update timed data (so data stream is not an option), so I have one indice for a month of data (name-2024.01.01 etc).
Every passed months are no longer written but every future months are receiving lot of writings and I need to find a way to tell elasticsearch to allocate the shards of thoses indices more evenly. At the moment, I have some nodes that have only a few futures indices and lot of past so they are doing nothing (load is around 1) and other have a lot of futures indices so are doing a lot of work and have a big iowait (load > 10).
All the future indices have a tier_preference of hot and old indices are warm or cold for the really old ones (< 2022), so I expected elasticsearch to balance according to that even if all the nodes accept every tier but that does not seem to be the case.

Obvious solutions I think of:

  • Setup specialized data_tier nodes: I'm currently working on that, I will add some hard drive cheap nodes for my cold shards that are never updated and almost never queried and I expect that will allow a more even spread of the hot shards on the other "normal" nodes (my guess is that ES is balancing almost only considering the number of shards/node right now, no matter the tier). If that works I may consider adding hot only and warm only nodes to spread even more, but I'm looking for the most economic solution right now, the budget is tight.
  • Increase number of shards of the heavy writing indices: also considered that but I don't want to overshard either (besides I would need to split 50 indices, that would need to be read only which is not really possible given the times it would take without update).

I'll take any advise on what to do here and if there is away to prioritize the write_load as a rebalance criteria (I tried increasing the cluster.routing.allocation.balance.write_load but it did nothing).

Thanks!

That's correct.

You need to use data streams so ES can take account of the write load.

1 Like

Thanks for your confirmation!
In my case I cannot use data_streams because the data are not append only, existing data needs to be updated regulary. So I guess in my case I'm back to my 2 solutions?

Just for my knowledge, is there any particular reason the write_load condition cannot apply to regular indices? I suspect it's easier to estimate this for data_stream which are supposed to have constant write load accross rollovers. But if we can configure a minimum number of days of observation for specific indices lets say a week in my case, if a shard is heavely written for a week, elasticsearch could estimate the future write load to be like the past week and rellocate accordingly?

Yeah that's pretty much it, data streams have a predictable rollover pattern so it makes sense to estimate the write load of the new index based on the observed write load on the previous one. If you're using regular indices then basically anything can happen, so even if you're using it in a way that could do write load estimation properly, Elasticsearch can't tell that this is the case.

How many shards are you considering?

1 Like

For now, each monthly index have 2 shards & 1 replica for now, the primary size is between 40 and 50Gb. The indexing rate (/s) for each of those indices can vary between a few hundreds to a thousand or more docs/s (for primary shards, as seen in the Kibana overview) and there is ~50 of indices concerned (2 templates concerned each with one index per month for around 2 years).

I was going to try 6 shards with 1 replica and eventually force the total_shards_per_node to one to be sure all of the 12 nodes only have 1 shard of each write heavy indices. But honesly I'm going blind here and do not really know what would be better so I'll take any advice you have.
In any case, I'm a bit afraid of splitting the ~50 indices concerned because I cannot really stop the update for more than 24h (this is a QA cluster which should go prod asap and need to be up to date).

They are all manager by ilm so they will be shrinked and forcemerges as soon as there is no longer any update.

For the record, I just added 2 cold nodes that can store all the cold data (which all have 1 shard & 1 replica) so the rellocation is in progress. I expect to see the hot and warm shards to be spread more evenly on the cold/warm nodes without the cold in the balance but the rellocate might take the whole day.

12 shards per index * 50 indices = 600 shards, that's not even close to anything I'd call "oversharded" in a modern ES cluster. We have benchmarks running at 50000 shards, and know of some clusters running with many times that.

IMO this sounds like a sensible plan.

Why not apply the change next month (e.g. tomorrow) when you create the new indices? That way, no need to split any indices which are seeing active write traffic.

1 Like

Thanks!

Just to be sure, you mean what I suggested so 6 shard per index + 1 replica (12 total) and a total_shards_per_node to 1 seems good for a 12 hot nodes cluster?
I guess I'll only know if I try :slight_smile:

Just to be sure, increasing the number of shard to this value which is low according to what you said should not have a bad impact on indexing? Could it impact search? Those hot indices are also going to be handling live requests when this go in production.

Yes applying the change to the template for the future indices is already done but that would give us a balanced cluster in 2 years from now as we create one index per month for 2 years in the future.
I might not have the choice and split to get the best performances but I'll probably be able to do that index by index and reactivate the update for a few hours in between so everything is still up to date. The indices are not that huge so it would probably not take to long for 1 index to split, I'll give it a try.

Do you need to do anything with past indices? I thought you said they don't see any write traffic.

1 Like

No past indices will just have to handle reads, no write once the date is passed.

To clarify a bit and simplify, I have to update data in the future 2 years which are splitted into 1 index by future month. So today I already have hot indices (that are writing) up to february 2026. The index that is going to be created next month is for the data related to march 2026, and so on each month. But the february 2024 index will only stop handling updates when the month is complete, so on 1 March 2024.
So I actually have 25 indices that are writing with the 2 shards configuration I had, and only one will become warm every month (and only one new hot with the new sharding configuration will be added), so that's why it would take 2 years to have every hot indices with the new sharding if I do not split.
Past indices today are all indices for months of 2023 and before.

Just to get back to that before doing anything I want to be sure, any reason why I souldn't go for even more shards per index? The more shards I have, the more probability there is that documents are not going to be routed to the same shard (and make a particular node heavy loaded while other are not) right?
So I could just go from 2 to 20 shards or even more, or is there a drawback I should take into account?

Ah, I see now. Thanks for clarifying.

I think I'd go ahead with splitting these future indices then. The split API requires the index to be read-only, but should be pretty quick so you won't need to block writes for very long for each index.

In general there's not much advantage of having more shards than nodes, especially for write traffic, and shards carry some overhead too.

1 Like

Thanks a lot for your help!

1 Like

Just a quick update, I actually cannot split from 2 to 6.

By the way I think the documentation is not consistent about that, the routing_shards setting is explained but the following is also written:

The number of primary shards in the target index must be a multiple of the number of primary shards in the source index.

But it actually depends on the number of routing_shards, which looks to be 1024:

the number of routing shards [1024] must be a multiple of the target shards [6]

So I might need to go to 8 but then I won't have a perfectly even number of shards on each node which was the goal.

I'm now considering to reindex everything from scratch, it would take a week but might be better to have exactly 6 shard + 1 replicas.

Ah damn I missed that, sorry.

I'd suggest a _shrink to 1 shard and then a _split to 6, which should be a lot quicker than reindexing.

1 Like

Oh yeah I didn't think about that, thank you very much!

Sorry I'm facing another issue, shrink require every shards to be on the same node to proceed. However this is not the case for any of my 50 indices, so I would need to rellocate them manually to the same node before proceeding.
Wouldn't be faster to use a reindex in this case? The shrink then split solution looked great but if I have to manually move everything it's going to be a problem...

Yes, shrinking to one shard requires the shards to be on the same node indeed, maybe I'm missing something but I would expect that to be easier than reindex still. It's all up to you of course, if you're more comfortable with reindexing then go for it.

Well it's easy for a couple of indices but for 50 I would need to create a script to do that automatically or do it manually 50 times but it would take a lot of time. Unless there is a way to force elasticsearch to do that other than manually rellocating the shards for each indices?
If not and if reindex does not take much longer to proceed than doing all these, it might be preferable. I guess I'll try both on two indices and see if there is a huge difference.