So this may sound a bit odd, try to follow with me.
I was trying to be clever and I used DNS Round-Robining to simplify the deployment of Elasticsearch nodes. For those of you unfamiliar, this is achieved by making multiple CNAME records with the same Name that point to different IP addresses (the IP addresses of the nodes) see attached picture for example. https://cl.ly/2n3T0J1l192b
The way this simplified deployment was by allowing the Elasticsearch.yml config files to look something like: discovery.zen.ping.unicast.hosts: ["esnodes.domain.com"] when a new node would go up, it would use the DNS RR to locate the cluster and join up (no list of IP addresses needed).
While this felt like the golden ticket to maintainability, and over the last 6 months it has gone well. I noticed that our elasticsearch logs sometimes get into a fit where they have odd network activity. Things like nodes going down due to ping timeouts, and other network(y) related problems.
After looking through all possible options I have replaced all the elasticsearch.yml unicast hosts, with hardcoded IP strings. All of the problems have stopped, and our logs are looking better than ever.
I am leaving this post for 2 reasons:
Does anyone have any more detailed explanation why this works/breaks..etc? just looking for more information
In case anyone ever stumbled on this post, don't make the same mistakes I have made.
I did a similar attempt a couple of years ago and for the same reason (to allow me to use the same elasticsearch.yml file on all nodes), but I ran into the same issues. I believe my problem was due to the round-robin from time to time hitting a struggling node, which failed to return cluster state / master info within the timeout limit and thus caused the node to fail to join the cluster.
So for me, round-robin was not a good solution. What I've done instead is define three dedicated master eligible candidates (with node.data: false) for the cluster, allowing me to use the same setting in the elasticsearch.yml files on all nodes in the cluster:
This has worked much better since at least two of the master candidates must be up for the cluster to be green anyway (when discovery.zen.minimum_master_nodes: 2), and thus new nodes should always be able to join a green cluster even if one or more data nodes are down or struggling.
Thanks for the reply, I am glad to hear that I am not the only one who had this problem. I agree with your theory
"I believe my problem was due to the round-robin from time to time hitting a struggling node, which failed to return cluster state / master info within the timeout limit and thus caused the node to fail to join the cluster."
I found in the logs several instances of this exactly happening. Which is why I came to the same conclusion that round-robining was the cause. Since switching away from RR it has worked perfectly without a single timeout in the logs.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.