I have a ECK cluster with 6 data/ingest nodes. I also have an ingest node pipeline deployed and processing logs from Filebeat. While it is running I noticed only one node being utilized near 100% CPU where the others hover around 25%.
When I dig the issue more from following command, it clearly shows only the one node does the pipeline processing.
GET _nodes/stats/ingest?filter_path=nodes.*.ingest
From this information, it seems the resource constraints of ingest are not being spread around appropriately.
Is there a configuration in our helm we missed by default? Is there a configuration as part of the Ingest Node Pipeline that we are supposed to set?
The list of Elasticsearch nodes to connect to. The events are distributed to these nodes in round robin order. If one node becomes unreachable, the event is automatically sent to another node. Each Elasticsearch node can be defined as a URL or IP:PORT . For example: http://192.15.3.2 , https://es.found.io:9230 or 192.24.3.2:9300 . If no port is specified, 9200 is used.
And can you see the traffic from the ingress load balancer is evenly distributed?
I ask because you can set round robin... but if that is by connection and there is 1 main connection and that is stuck on a single node ... I have seen that.
Sorry for the late response. We have put our ingest nodes on dedicated nodes which helped -- this led to an even CPU spread across the availability zones.
But, we are still seeing the unevenness with our data nodes ... basically our a-0 node is always over %85+ where the other nodes are in the low teens. I ran a hot threads and got some of the following results (sorry I cant copy/paste them in here):
68.5% (342.7ms out of 500ms) cpu usage by thread 'Elasticsearch[eck-Elasticsearch-es-data-zone-a-0][filebeat-000839][0]: Lucene Merge Thread #594]'
68.5% (339.1ms out of 500ms) cpu usage by thread 'Elasticsearch[eck-Elasticsearch-es-data-zone-a-0][write][T#1]'
64.0% (320.1ms out of 500ms) cpu usage by thread 'Elasticsearch[eck-Elasticsearch-es-data-zone-a-0][write][T#3]'
This means that a merge / force merge is taking place on that node... if your ILM policies are requiring force merges that could be the source.. merges happen on the data node where the data resides..
Otherwise looks like there are a lot of writes... if there is a single primary shard then the writes are happening there...
You can look at the _threads as well.
Otherwise I do not have much more suggestions perhaps someone else will.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.