I currently have an Elasticsearch cluster (7.15.1) Which has 3 masters, 6 data nodes, 2 ingest nodes which is deployed onto Kubernetes in AWS. Each data node is using an m5.2xlarge AWS instance and a 1.5TB storage volume. (Memory is set on the ES Container to be 28GB and the Heap is 50% of that, and uses 7 of the 8 cores)
The largest filebeat log to the server has a template that uses 3 primary shards + 1 replica, 15 second refresh interval, and an ILM that rolls over at 20GB per primary shard or 1 day, and stores for 7 days until delete.
For the most part, sharding spreads across all data nodes evenly, but the CPU of the first data node is usually always over 90%, while every other data node is 25% or less at all times. This seems to cause (or is related) to backlogging issues where incoming logs take time to actually be written to elastic.
If I perform a hot threads check, it's either lucene_merge, refresh, or write that uses anywhere from 50-75% of the cpu for a single thread.
Everything is load balanced, and shards appear to be equally distributed among all data nodes.
My question is why would a single data node always be so much higher than the others. There is nothing specific this data node should be doing that shouldn't be shared or spread across the others.
the setup sounds sane and good from the description. If the shards are distributed evenly, all the nodes basically have to do the same work of merging, refreshing and writing.
Are the sizes of each shard also the same - just to be sure, that one node is not dealing with abnormal amount of data.
This is why it gets confusing. The data should be written to all 6 data nodes. I see dev tools showing me the list of shards being distributed. I'm not sure why one node would have an abnormal amount of data, really. It's confusing.
They all show nearly the same total storage. It's just that one single data node keeps pulling 90% cpu when the others do not.
No, I am not. They all route to an AWS load balancer which has all instances (non-sticky) listed in its target group. I know I don't NEED a load balancer, since it uses an internal one, but that's just how it is. So no, it isn't going to a single node only. It only appears that way CPU-wise.
So I have done several things to expand the cluster. I increased sharding, added warm nodes, and increased the CPU load on the hot nodes by doubling the instance type.
What I keep seeing is a single data node with around 90% cpu usage. there are 6 hot data nodes, but this specific one has almost twice the data on it than any of the others, and I don't know why. I haven't touched any of the default settings as far as re-balancing, and most of my smaller indices are 1 primary+1 replica shard. It's the large cluster sending it logs that is taking up all of the space. So it really comes down to a single multi-sharded index somehow taking up the space and not getting balanced properly.
Of course, I am not sure if one has nothing to do with the other... meaning the high CPU may not be because of the higher amount of space used, but I can't find anything else other than write and lucene taking up the hot threads.
Along with that, one specific node has a much higher document count, CPU load, IOPS, and segments. Maybe this specific node mostly has shards related to the large index and it is balancing on total shards for the cluster and not based on the 1 huge index.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.