We have 37 data nodes but our clients only use 6 of them to send requests to.
Is there any reason why the entire set of nodes couldn't be used by the clients? I know that some clusters use dedicated coordinator nodes, but I don't see a justification for that in our system given the sheer number of nodes we have since each can serve as a coordinator AND our generally low CPU usage. Is there any reason we couldn't just say that any one of the 37 nodes is a valid coordinator node to use for indexing/querying?
By using all data nodes as coordinator nodes, I see it as reducing the coordination load by 83% on each of the 6 nodes that do it now, while adding just a marginal overhead to each of the remaining nodes (if these 6 nodes can do it and still handle their data responsibilities, then the others should be able to handle 1/6 of what the current coordinators are doing). It also seems like it's more resilient. Right now if one of the 6 goes down we've lost 16.6% of our coordination ability, vs 2.7% if one of the 37 goes down.
Seems like a no brainer, which means I must be missing something.
If you have dedicated master nodes, you generally want to leave these out and not send requests to them. If you have a tiered architecture, e.g. hot-warm-cold, it may also make sense to direct requests only to certain nodes. If the cluster however is homogenous it is generally better to distribute the load as evenly as possible.
I see. Does your clients know the nodes where their data resides for a search request? Do they send search request with node_ids set in preference field?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.