Ingest Node Problems

Hi,
I've just enabled an ingest node pipeline for a subset of logs ingesting to one of our 5.6.3 clusters. The pipeline looks like this:

{
  "test" : {
    "description" : "Test Logging Pipeline",
    "processors" : [
      {
        "set" : {
          "field" : "pipeline",
          "value" : "test"
        }
      }
    ],
    "on_failure" : [
      {
        "set" : {
          "field" : "_index",
          "value" : "failed-{{ _index }}"
        }
      }
    ]
  }
}

After enabling the pipeline logging stats look like this over a 90 minute window:

  • Test log source: 33.5 mill down to 68.5k
  • Other sources: 36.7 mill down to 247.5k

We're running with 12 data nodes and 3 ingest nodes. Data nodes are quad core with 16GB HEAP and ingest nodes are dual core with 12GB HEAP. On beginning of ingestion to the pipeline load averages on the data nodes rose to as much as 7. All of our indexers send their logs via the ingest nodes. The majority are not using ingestion pipelines....

So, I'd like help to understand the collapse in throughput here?

Regards,
D

All of our indexers send their logs via the ingest nodes

I believe this is your issue.

Ingest nodes work from the bulk/write queue with # of processors pulling from that queue [1]. So when you push all ingest data through only you ingest nodes in your configuration, you are pushing all ingest related needs through 6 cores (3 nodes * 2 cores). You can think of the 6 cores as concurrent slots available to handle ingest needs. In your case, those 6 slots need to either a) forward index requests to and wait for responses to respond to the client with success/failures ... or b) pre-process the data via an ingest pipeline then forward index requests and wait for responses to respond to the client with success/failures. Everything beyond 6 will get queued. In both cases you have diminished your "slot" bandwidth to handle ingestion from 48 (12 nodes * 4 cores) down to 6, AND added additional work to be done.

In this case, (based on the information in the original comment), I would not suggest to use dedicated ingest nodes. Rather spend the additional capacity on more data nodes (1-2 more) and allow all data nodes to be ingest capable nodes, and ensure that your clients are sending to all of the data nodes. This should help to prevent bottlenecks, increase max CPU capacity, and should help cover the overhead the additional overhead the ingest node is performing.

[1] Thread Pool | Elasticsearch Reference [5.6] | Elastic

Thankyou @jakelandis, that all makes perfect sense. Does this mean that, putting pipelines to one side, we should "never" ingest via coordinator nodes? I must confess I've found some of the feedback/recommendations in this regard over time contradictory. The reason we're ingesting as we are is based on prior recommendations.

Also, given that our data nodes are quite loaded atm would enabling ingest node features add much overhead? Assuming the pipeline config isn't complex...

Does this mean that, putting pipelines to one side, we should "never" ingest via coordinator nodes?

For clarity, a coordinator only node's (as defined by the doc [1]) do not pre-process the data via ingest pipelines.

Ingestion via coordinator only nodes is just fine, you just need to make sure you have enough resources in your coordinator only nodes to not bottleneck the ingestion. Adding the ingest node role to the coordinator only node (making it no longer a coordinator only node) changes the math of how much resources are needed.

would enabling ingest node features add much overhead? Assuming the pipeline config isn't complex...

The overhead depends on a lot factors, you can look at our nightly benchmarks: Elasticsearch Benchmarks that is running a grok pipeline, which reduces throughput by ~ 30% (YMMV). You can also run your pipelines with your data in your environment with Rally [2], can use at the http track as an example [3].

[1] Node | Elasticsearch Guide [8.11] | Elastic
[2] Quickstart - Rally 2.10.0.dev0 documentation
[3] https://github.com/elastic/rally-tracks/tree/master/http_logs

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.