Performance bottleneck enriching documents due to only a single node processing the ingest pipelines

We have a cluster of 4 nodes: A, B, C, and D. All four have all default roles assigned, including the ingest role. We also have a separate master node.

We have recently put new ingest pipelines into use with which we enrich documents we send to Elasticsearch from one or multiple enrich indices. When we use these ingest pipelines for enriching documents we send to Elasticsearch, only node B becomes very active processing these (verified using GET _nodes/B/hot_threads) and the other three are not involved. Because only node B is involved and it cannot handle the load required for these ingest pipelines (it must process millions of documents, totalling several hundred GB of data), it becomes a huge performance bottleneck for us.

The index we are writing to has its primary shard on node A and a replica on node C, so nothing on node B. The various enrich indices have shards on all four nodes and their corresponding source indices have shards on nodes A and C or nodes A and D, so again nothing on node B.

What is going on? Why would node B, out of all nodes, become active and how could we fix this and spread the load?

How are you sending the data to Elasticsearch? Have you checked whether you are sending data to all nodes?

Are you using the enrich processor, right?

Can you share your enrich policies?

Also, how are you sending your data to elasticsearch and what are the specs of your nodes?

The recommendation from the documentation is to use dedicated ingest node for heavy loads.

It looks like that may well be the issue. The IP we are using to connect to Elasticsearch is precisely that of node B as opposed to those of the other nodes.

How can I send data to all nodes? We are using .NET and NEST version 7.17.4. Is it a matter of supplying the URLs for each of the nodes to NEST's ElasticClient?

I am not a .NET developer, but that is typically how it works for most other clients.

It looks like this resolves the ingest only being run on a single node, but I have difficulty properly verifying this, because we have had to temporarily stop using the ingest pipeline because of resource limitations in multiple of places. According to the Elasticvue Chrome extension our Elasticsearch cluster is very constrained in its RAM (but not its heap) resources (all our nodes usually use 95-99% of the RAM they have), even when we're not using the ingest pipeline. Do you have any thoughts on how we might improve Elasticsearch's RAM usage?

It is always recommended to run Elasticsearch on its own dedicated nodes. Elasticsearch relies on the heap (typically no more than 50% of available RAM) but also stores some data off-heap. In addition to this it relies on the operating system page cache for performance. This can quickly use up all available RAM on a host, but that is expected and not a problem since any memory assigned to the page cache can quickly be reclaimed by the operating system if any other process should require it.

What you are describing therefore sounds normal and nothing to worry about, assuming page cache usage is included inyour measurement.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.