Hypothetical - ingest nodes vs data nodes

I got into a design discussion:

Is it better to use ingest nodes or direct to data (hot) nodes? One thing that I don’t think I’ve ever seen discussed is this, say we’re ingesting to data nodes, a bulk request will likely have events for shards that are not on this particular node, maybe none of the target shards are on this node. So after processing, this node must send the events to the nodes owning the active shards. Plus replication traffic, we don’t know if the “ingesting” node sends to all replicas, or just to the primary replica host and it must then forward replica data on.

If an ingest node is used, we know that none of the data lives here, so it all must be forwarded along, but the network load on the data nodes would be lower.

But, we decided that we didn’t know if this was significant enough to care about :slight_smile:

Are you using ingest pipelines?

If you do, I think it's better to send to ingest nodes so the ingest node can directly run the pipeline and then forward to primary and replicas.
If not, then sending to data node might in some cases avoid a bit of network traffic if you are lucky enough to have the shards assigned on the same node.

This is from a very old talk I was giving in 2015 IIRC.

Another solution is to have client nodes + ingest nodes + data nodes. The client is getting the traffic then passing it to ingest nodes and then it goes to data nodes.

Not sure it answers your question... :slight_smile:

As far as I know it is the node holding the primary shard that forwards requests to the replica shards.

If you are using ingest pipelines I would say that it depends more on the load these pipelines cause than the traffic between nodes. The hot nodes already do a lot of work and if they often are under heavy load I would go with dedicated ingest nodes. For more lightly loaded deployments I would consider this recommended but optional.

If you are not using ingest pipelines I do generally not see any need for client nodes in front of the hot nodes for indexing. For querying it can take some load off data nodes, so may be a good option though in some cases. Here I generally recommend running benchmarks to see if it is worth it or not.

Thanks to both, that adds good info (Sorry I can’t mark both as “Solution”)

We never have been big enough to need ingest nodes.

When we were self-hosted, with what was about a 5.X design, we were a heavy logstash user, I don’t think ingest pipelines existed. Our physical servers had excess CPU and RAM, so I ran a logstash on each hot node (yes, now sacrilege). Beats sent round-robin to those nodes, each logstash forwarded on to it’s “local” hot node. These were rolling-upgraded thru 7 or 8, but we mostly stayed with Logstash and the equivalent “integration” pipelines that were available.

We hosted Kibana on 2 client nodes. Because they were the DNS for our “Elastic” service, anyone using scripts would also pass thru those nodes. It helped to keep bad things users might do from hurting our data nodes :slight_smile:

Now were cloud, agents send to the hot nodes. We use a lot of the pipelines from the provided integrations, but very few and very simple ones that we have developed. We completely eliminated logstash. There were a few syslog type things that we host on local VM’s as necessary for network access, but we use Agent integrations.

It’s easier for me to pass knowledge on one method (Agent) than to try explain why we would use agent, beats and/or logstash for various streams.