Elastic ingest node load not balanced with pipeline

Hi,

I have a ECK cluster with 6 data/ingest nodes. I also have an ingest node pipeline deployed and processing logs from Filebeat. While it is running I noticed only one node being utilized near 100% CPU where the others hover around 25%.

When I dig the issue more from following command, it clearly shows only the one node does the pipeline processing.

GET _nodes/stats/ingest?filter_path=nodes.*.ingest

From this information, it seems the resource constraints of ingest are not being spread around appropriately.

Is there a configuration in our helm we missed by default? Is there a configuration as part of the Ingest Node Pipeline that we are supposed to set?

Thanks,
Scott

Did you try using and array of hosts in filebeat for elasticsearch output?

See Here

hosts

The list of Elasticsearch nodes to connect to. The events are distributed to these nodes in round robin order. If one node becomes unreachable, the event is automatically sent to another node. Each Elasticsearch node can be defined as a URL or IP:PORT . For example: http://192.15.3.2 , https://es.found.io:9230 or 192.24.3.2:9300 . If no port is specified, 9200 is used.

Hi Stephen,

We use a load-balancer / ingress in front of our Elasticsearch nodes. We list that as our host endpoint.

Thanks,
Scott

Then it could be a couple things first to come to mind

  1. your load balancer is "Sticky" instead of round robin.

or

  1. You only have 1 primary shard so where that primary shard resides is where the hot node is.

We have ensured that the load balancer is not sticky and have 3 primary shards (1 replica) spread across the 6 data nodes.

Any other ideas?

Did you run Hot Threads on that node to see what it is?

Yes, but if I remember correctly we just saw the pipeline processing we expected.

What should we be looking for? I'll re-run that API call

Whatever is taking up the most CPU....

And can you see the traffic from the ingress load balancer is evenly distributed?

I ask because you can set round robin... but if that is by connection and there is 1 main connection and that is stuck on a single node ... I have seen that.

Hi Stephen,

Sorry for the late response. We have put our ingest nodes on dedicated nodes which helped -- this led to an even CPU spread across the availability zones.

But, we are still seeing the unevenness with our data nodes ... basically our a-0 node is always over %85+ where the other nodes are in the low teens. I ran a hot threads and got some of the following results (sorry I cant copy/paste them in here):

68.5% (342.7ms out of 500ms) cpu usage by thread 'Elasticsearch[eck-Elasticsearch-es-data-zone-a-0][filebeat-000839][0]: Lucene Merge Thread #594]'

68.5% (339.1ms out of 500ms) cpu usage by thread 'Elasticsearch[eck-Elasticsearch-es-data-zone-a-0][write][T#1]'

64.0% (320.1ms out of 500ms) cpu usage by thread 'Elasticsearch[eck-Elasticsearch-es-data-zone-a-0][write][T#3]'

This means that a merge / force merge is taking place on that node... if your ILM policies are requiring force merges that could be the source.. merges happen on the data node where the data resides..

Otherwise looks like there are a lot of writes... if there is a single primary shard then the writes are happening there...

You can look at the _threads as well.

Otherwise I do not have much more suggestions perhaps someone else will.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.