Im adding a new data node in ES running version 6.8.21 and i cant update due to other realated dependencies.
when join the new datanode the reroute start of shards works fine for around 30-50 min then the server starts to get very high CPU and just stops/hang for cluster in overall to realocate shards.. so i have to remove the node after around 1 hour and cluster works good and
we got 10 masternodes 90 Data nodes with 600TB data and around 300 shards per datanode, allocated 72Gb ram and heap 32GB, 36 cpu's
i did adding 20 other datanodes and thoose i had no issues with them, thoose have exact same config for deploy it. also same hw and OS is running rhel 7.9. cant upgrade OS
we have done lots of investigation regarding if there are some hw/OS/sandisk issues but cant find any issues in any of that.
so does anyone had similiar issues of adding a new node ?? or any hints what i can check more ?
the server starts to get very high CPU and just stops/hang for cluster in overall to realocate shards
Just to be clear, the new data node has the very high CPU after 30-50 minutes, but the other 90 are OK? But that one node's issues effectively hangs the whole 101-node cluster until it leaves and cluster becomes a 100-node cluster again?
I presume you do want it to get some shards re-allcated to it, this is why you are adding it as a data node?
Though not really topic of your question, is a 91-datanode cluster going to be a game changer from a 90-datanode cluster? As you've noted, both RHEL and elasticsearch are pretty old, both around 5 years, so maybe better to not tinker with it too much?
yes all over works fine also the last 20 i added as datanodes. just this one that dont.
yes know we are outdated in version, but thats not possbile to update/migrate due to other dep's
well it will be some amount of disk that can take the shards so we can be sure we are fine until we can install new env with hw/OS
we get around 15TB logs per day to this cluster.
so we want to extend it as much we can untill move to new env.
but as u say maby just let it be since it works good as is now..
Its a "kick a problem down the road" strategy, we all do it.
And is it 90 --> 91 and you are done. Or 91 will become 92 and ... on an regular, ongoing basis? If its 91 and done for now, I'd just define reaching 90 as victory
Again, just for clarity, are we talking physical servers, stacked somewhere in a data centre, here? Or some kind of servers within a virtual environment?
I can think of a couple of options
You just hit a limit somewhere (and everything has a limit). You hit it at 90--->91. What limit is is harder to define, can be in the environment, network, SAN, wherever.
Though you think not, there is some subtle difference between server 91 and the other 90. Maybe its SAN HBA is set wrong, or its network adapter, or BIOS is not right, or ...
That version of elasticsearch has a subtle bug which you just hit. I'd rate that as unlikely.
I believe you can set the new node to not be eligible for "old" shards to be auto-re-allocated-to, and just used for new shards/indices? I dont recall the specific setting. I do vaguely remember long ago using a plugin (was it kopf?) and moving some shards around manually, at night, when our cluster was at its quietest when hit with some not dis-similar issues.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.