My cluster currently involves:
Machine 1 : ES - co-ordinating node + Kibana
Machine 2 : ES - master + data + Logstash
Machine 3 : ES - master + data + Logstash
Machine 4 : ES - master + data + Logstash
Sharding : 1 primary per index, 1 replica, Logstash creates monthly indices.
The 3 Logstash instances (Machine 2, Machine 3, Machine 4) are set to pull from Kafka - which nodes should I set the ES output to?
I have come across articles stating to add all the ES-data-eligible nodes to this list.
My question is:
With 1 primary shard per index, what happens when a document is sent to the node containing the replica shard for that index?
What happens when I add another data-only ES+Logstash node to the cluster? What happens when I add another mast+data-ES + Logstash node? Do I include or exclude these nodes from all Logstash outputs?
Would it be better to send all Logstash outputs to the co-ordinating node instead?
data nodes is the way to go. If the primary shard is not on the node, that the client sends the document to, it will be rerouted internally.
one of the ideas here is that you do not need to worry about topology. As a user you would like to have a URL (or a list of URLs) to connect, but you dont care if your cluster is three node or a hundred.
So, why data nodes instead of coordinating nodes? There is a probability if you hit a data node, that the primary shard is local, so no forwarding needed, whereas the coordinating node will always have to forward. Also, you would sent all your data to a single coordinating node, instead of spreading the load across several data nodes in this setup.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.