I'm currently working on some automation for running Elasticsearch in a clustered fashion on top of Kubernetes, and would love to be able to manually trigger a master re-election (or alternatively, disallow the current master from being master, similar to setting cluster.routing.allocation.exclude. Right now upon a scale down event involving the master node, the cluster can turn red for up to 30s (thus serving no requests).
This downtime can, and should really be avoided and so far it seems there's no graceful way to do so. This is a request/thread to open discussion on how this can be implemented, coming out of the GitHub issue here: https://github.com/elastic/elasticsearch/issues/17493
I have attempted to use the DNS/ping discovery, but ran into other separate issues with that (unrelated to this particular issue).
Some of our users would find any period of downtime like this unacceptable, so I feel like even reducing this time is not a true solution to this feature request
While we do have plans to speed up the 3s for the case where the master left and all nodes respond promptly these are not coming soon (it's non trivial to say the least). That said, I think we should clarify what those 3s mean - by default search and gets will be served fine. Indexing operations will wait until a new master is elected and proceed as before - no request should be rejected. Are you seeing something else?
PS - you should find out why election takes 30s - it's indicative of something else that's wrong.
I've not done particularly thorough testing of what does & does not work during this period, I have been using Kibana to monitor my cluster and it goes all red during this time, hence I assumed that the majority of cluster operations were not functional.
I'll run some tests now to determine exactly how long it goes unavailable for, and what is unavailable and get back to you. Is the re-election timeout configurable from 3s? (I ask so I can check to see if mine has been set to anything other than 3s!)
I've not done particularly thorough testing of what does & does not work during this period, I have been using Kibana to monitor my cluster
Yes, losing a master does make the cluster go red during election. It's not a "lite" event ...
I assumed that the majority of cluster operations were not functional.
All operations should either be served or wait for a new master to be elected and timeout with a reasonable timeout (30s for master level operations like creating an index , 60s for indexing).
Is the re-election timeout configurable from 3s?
The settings is discovery.zen.ping_timeout. See here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.