New datanode join ES cluster cpu high

anders_larsson · January 22, 2025, 9:30am

Im adding a new data node in ES running version 6.8.21 and i cant update due to other realated dependencies.
when join the new datanode the reroute start of shards works fine for around 30-50 min then the server starts to get very high CPU and just stops/hang for cluster in overall to realocate shards.. so i have to remove the node after around 1 hour and cluster works good and

we got 10 masternodes 90 Data nodes with 600TB data and around 300 shards per datanode, allocated 72Gb ram and heap 32GB, 36 cpu's
i did adding 20 other datanodes and thoose i had no issues with them, thoose have exact same config for deploy it. also same hw and OS is running rhel 7.9. cant upgrade OS

we have done lots of investigation regarding if there are some hw/OS/sandisk issues but cant find any issues in any of that.

so does anyone had similiar issues of adding a new node ?? or any hints what i can check more ?

// Anders

RainTown · January 22, 2025, 10:26am

the server starts to get very high CPU and just stops/hang for cluster in overall to realocate shards

Just to be clear, the new data node has the very high CPU after 30-50 minutes, but the other 90 are OK? But that one node's issues effectively hangs the whole 101-node cluster until it leaves and cluster becomes a 100-node cluster again?

I presume you do want it to get some shards re-allcated to it, this is why you are adding it as a data node?

Though not really topic of your question, is a 91-datanode cluster going to be a game changer from a 90-datanode cluster? As you've noted, both RHEL and elasticsearch are pretty old, both around 5 years, so maybe better to not tinker with it too much?

anders_larsson · January 22, 2025, 10:47am

yes all over works fine also the last 20 i added as datanodes. just this one that dont.
yes know we are outdated in version, but thats not possbile to update/migrate due to other dep's
well it will be some amount of disk that can take the shards so we can be sure we are fine until we can install new env with hw/OS
we get around 15TB logs per day to this cluster.
so we want to extend it as much we can untill move to new env.
but as u say maby just let it be since it works good as is now..

RainTown · January 22, 2025, 11:16am

Its a "kick a problem down the road" strategy, we all do it.

And is it 90 --> 91 and you are done. Or 91 will become 92 and ... on an regular, ongoing basis? If its 91 and done for now, I'd just define reaching 90 as victory

Again, just for clarity, are we talking physical servers, stacked somewhere in a data centre, here? Or some kind of servers within a virtual environment?

I can think of a couple of options

You just hit a limit somewhere (and everything has a limit). You hit it at 90--->91. What limit is is harder to define, can be in the environment, network, SAN, wherever.
Though you think not, there is some subtle difference between server 91 and the other 90. Maybe its SAN HBA is set wrong, or its network adapter, or BIOS is not right, or ...
That version of elasticsearch has a subtle bug which you just hit. I'd rate that as unlikely.

I believe you can set the new node to not be eligible for "old" shards to be auto-re-allocated-to, and just used for new shards/indices? I dont recall the specific setting. I do vaguely remember long ago using a plugin (was it kopf?) and moving some shards around manually, at night, when our cluster was at its quietest when hit with some not dis-similar issues.

There's also all the settings at

e.g.

cluster.routing.allocation.cluster_concurrent_rebalance

which I guess you could set to 1?

But you need also guard against possibility of one or more of the other 90 nodes hitting an issue, both "now" and in the future.

RainTown · January 24, 2025, 1:02pm

Any progress @anders_larsson ?

system · February 21, 2025, 1:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data node high CPU Elasticsearch	19	3733	February 26, 2018
Very high CPU usage on one Elasticsearch data node Elasticsearch	18	33879	May 9, 2018
CPU for one of the nodes is high frequently Elasticsearch	2	592	September 28, 2017
Space ending in ES cluster Elasticsearch	15	4350	July 5, 2017
Constant High (~99%) CPU on 1 of 5 Nodes in Cluster Elasticsearch	7	842	July 6, 2017

New datanode join ES cluster cpu high

Related topics