Balance Nodes by CPU Usage

Is there a way to balance the nodes by CPU instead of shard count?
I'm constantly seeing 1-3 of our hot nodes sitting above 60% cpu usage where the other 6 hot nodes are around 10%.
They all have the same amount of shards on them, but this leads to degraded performance. Sometimes the hotter nodes can reach 100% CPU and lead to queueing for logs.
I've seen some older posts about this issue but haven't seen a real resolution
Is there a setting to adjust?
Is there a way to set a hard limit on CPU so it wont assign any more shards or even move shards off these hotter nodes?

No, this is not possible.

Elasticsearch will try to balance the shards equaly between the nodes on the same data tier.

You need to try to troubleshot what may be the issue of this high CPU usage.

For example, How are you sending data to your Elasticsearch? Are you load balacing the requests between all your nodes?

What are the specs of the nodes? They have the same specs? Are all the nodes configures as ingest nodes as well?

Also, what is the return of the hot threads API for the high cpu usage node?

Why not?
Is it a tech constraint, that Elasticsearch cannot be balanced by resource?
We have multiple teams that use this Elastic stack.
Should we use a single Elastic stack per team so that when it balances by shard the resource usage isn't skewed?

Because elasticsearch does not support it.

For Elasticsearch a cluster is considered balanced when the nodes from a data tier have the same number of shards on it.

If you have 9 hot data nodes and 900 shards, a balanced cluster will have 100 shards in each node.

There are some settings that can be changed that would make Elasticsearch consider the write load needed for each shards when balancing the cluster more, but this is not the same as to balance based on the CPU usage.

You can read more about what counts towards rebalancing in the documenation.

Not sure, you didn't provide enough context about how you are using Elasticsearch, maybe your issue is unrelated to the number of shards on each node.

As I said, you need to troubleshot why just a couple of nodes have high cpu usage and for this you would need to provide more context about how you use Elasticsearch, how your cluster is configured etc.

What a poor app.

How many primary shards are there for the indices that are being written to on those nodes?

It is also possible that there is something else going on... Have the following you run to actually see what is going on?

GET _nodes/hot_threads

CPU balancing is accomplished by balancing the reads / writes across the nodes... but alas elasticsearch is a distributed data store / search engine so the reads / writes need to go where the data is so distributing the data is a key to distributing the load/CPU.

Elasticsearch is not a stateless service where the load is simply round-robinned...

Also I have seen other cases where the clients read and write data with "Sticky sessions" to particular nodes...

There could be many reasons...

As already said there are many things that can lead to high cpu usage, like misconfiguration on the server side, misconfiguration on the client side, underspecification of the nodes, slow hardware and many others.

There are also may ways to improve elasticsearch performance.

You didn't provide any context about your issue nor about your use case, if you want further help in troubleshooting this you need to provide context.

it is impossible to know what may be the issue without more information.

2 Likes

This isn't true in recent versions. Indeed, the very docs you linked disagree:

The weight of a node depends on the number of shards it holds and on the total estimated resource usage of those shards expressed in terms of the size of the shard on disk and the number of threads needed to support write traffic to the shard.

"Number of threads" is effectively CPU usage, although note that this only applies to indexing load today. Searches are harder to balance, but in practice search load tends to be reasonably well balanced when we balance the shard count and the size on disk. However the OP's problem seems to be indexing load, so that should work for them.

1 Like

Yeah, I mentioned that the write load is one of the factors for the rebalance, but I do not understood this as the same thing as balance based on CPU usage.

There are many factors that can cause a high CPU usage, like the index load, search load, maybe some heavy ingest pipeline processors, but if Elasticsearch consders only the index load than I do not see this exactly as a balance by CPU usage, it will not start moving shards if for some reason the CPU of the node increases per a specific amount of time. Or it will? Never saw this on my cluster.

But without more context from the OP it is not possible to know what is the cause of the high cpu.

A quick question about this, according to the documentation the write_load heuristics settings only works if you are using data streams.

Shards that do not belong to the write index of a data stream have an estimated write load of zero.

It also seems to only work with an enterprise license according to the subscription page, is that right?

Yes, correct on both counts.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.