I've read that the best practice for querying/searching is to use a coordinator node, which makes sense to me (a node that's not busy on disk operations for indexing and uses memory for gather phase of searches).
However, when indexing data the best practice is to write directly to data nodes. As far as I know, when a write/index request is issued to a data node, it routes the request to the node which holds the primary shard of the requested index (not an HTTP redirect, the TCP connection is still open to that node but the node forwards the data to the primary shard holder).
In that case, why not use a coordinator node as well? The larger my cluster is, the higher the chance I won't be hitting the node which holds the primary shard and in that case, the data node is going to be busy forwarding all of the data instead of other tasks. Why not create a set of dedicated nodes (maybe even with a load balancer) which will forward the traffic to the right data node?
If I got the explanations completely wrong, could you please explain what's the motivation (in low technical level) for using coordinator node for searches only?
I'm curious where this idea has come from. Indexing can also benefit from going via a coordinating node indeed, for exactly the reasons you describe. You might want to use dedicated ingest nodes if running ingest pipelines, but even if you're not using ingest pipelines then dedicating some nodes to coordinating indexing can make sense.
Hmm. Both from experienced Elasticsearch people. I wonder if I'm missing something. @thiago or @Christian_Dahlqvist can you explain your thinking there?
Ingest is hopefully in the form of a bulk request, several docs to be indexed. Whatever node gets this request has to decide where to send each item. We know that shards are assigned by modulo math on the doc id. I've wondered if doc id's are generated to spread events across shards out of each bulk request, (even if just random doc id'sare generated) or if they are generated to send all to the same index (maybe each bulk request becomes a segment).
In a multi node stack with an index with multi shards, it seems that no matter what node receives the bulk requests, some events will scatter to nodes containing all segments. An ingest only node would "never" receive an event to store locally, data nodes might statistically approach 1/nodes events% of events that it can store locally.
I think we have a pretty robust stack for our needs by using beats and logstash "load balancing" for elasticsearch output, listing all of our hot nodes.
Coordinating only nodes can certainly be useful when indexing as well as querying, but I would generally expect that to be the case for larger clusters. For small clusters I think the addition of dedicated coordinating nodes often adds relatively little value and an additional data node may have a bigger impact. Dedicated ingest nodes may however make sense at a much earlier point.
Thank you for the clarification! A few small questions, if I may -
What is considered a "large cluster"? Do you have a thumb rule for this scenario? My cluster is currently indexing 5TB per day, would you consider that large?
Would you recommend on ingest nodes that aren't doing any processing or filtering (in my case it is done prior to writing to Elasticsearch by various tools)? Is that any different from a coordinator node, given that ingest nodes are coordinators as well?
Yes, I would consider that large. I assume you have a reasonably large number of nodes in the cluster.
Whether you benefit from dedicated coordinating only nodes will depend on what is limiting performance in your cluster. It will move some processing off the data nodes and reduce your network traffic some, but it is hard to estimate the effect this will have without you testing it.
I recall seeing issues with extremely large number of coordinating-only nodes added, but if you keep it to a reasonably small number I do not see any issues. The more nodes you have the more work required to propagate cluster state changes. How much impact this has will depend on how frequently your cluster state updates.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.