No there is no support for a full cluster restart in the ECK operator. Why do you want to do a full cluster restart? There should be under normal circumstances no reason to do that. There are however exceptional circumstances where it might be necessary to restart individual Elasticsearch nodes for example if you are running into a bug in Elasticsearch or something of that sort. In such cases you can force restart a node by just deleting the corresponding Pod. The operator or more precisely the StatefulSet controller will immediately recreate it.
Hello @pebrc . Thank you for quick answer. I have this use-case:
Using helm I must be able to turn off / turn on security authentication. It works well using
config:
xpack.security.enabled: false # or true
When I deploy ES cluster with no security authentication, deployment of ES nodes works well.
When I set variable in helm to enable authentication, ES nodes are redeployed and everything is OK and secured. But when I decided to turn off security, only one ES node of 3 is going to restart.. So the result is that only one node has security turned off. I have to do manual restart - it works out... So only for this usecase I would like to use different updateStrategy.
I am confused are you using the ECK operator or are you using the Elastic helm chart for Elasticsearch?
If you are using the ECK operator then turning off xpack.security.enabled is not supported. This is a setting that is managed by the operator. You can however configure anonymous access if that is required.
I can confirm that turning off /on of security using xpack.security.enabled works (except in this case ). Although I found in ECK operator logs this line:
{"log.level":"info","@timestamp":"2022-08-24T11:31:19.255Z","log.logger":"elasticsearch-controller","message":"Elasticsearch manifest has warnings. Proceed at your own risk. [spec.nodeSets[0].config.cluster.name: Forbidden: Configuration setting is reserved for internal use. User-configured use is unsupported, spec.nodeSets[0].config.network.host: Forbidden: Configuration setting is reserved for internal use. User-configured use is unsupported, spec.nodeSets[0].config.xpack.security.enabled: Forbidden: Configuration setting is reserved for internal use. User-configured use is unsupported]"...
xpack.security is a special case where you will find it hard if not impossible to change it because your will basically take away the ability of the nodes to talk to each other other during a the rolling upgrade (this is why in your experiment only one node was upgraded). Even if you were to force the upgrade through by manually deleting all Pods you would still end up with all the readiness probes failing and thus no endpoints to talk to. In short it would be an uphill battle and we actually want users to keep their clusters secure thus the default to true
Thank you very much Peter @pebrc. In case of manually deleting all pods everything works well - readiness probes didn't fail.
In case of Kibana I had to edit readinessProbe and it works out:
No there is no support for a full cluster restart in the ECK operator. Why do you want to do a full cluster restart? There should be under normal circumstances no reason to do that.
In our operation we do have a reason to do that under normal circumstances: we regularly perform Kubernetes version upgrades on our live clusters, with as little downtime as possible, and now with the introduction of ECK we need to do the same on nodes where an ECK cluster is running. We adopt a blue-green approach to upgrade the version of K8s in several of our non-ECK services:
Cordon the current node pool
Create a second node pool of the same size, with new K8s version
Use kubectl set env (thus setting in motion a K8s rolling update)
Monitor until all nodes migrate from the first pool to the second pool
Drain the old pool.
Since step 3 is not an option with an ECK cluster, what should we do instead?
The ECK StatefulSet has an updateStrategy of OnDelete, so we cannot trigger an automatic K8s rolling update.
Our Elasticsearch manifest is not changing at all (only the K8s version is changing at node level) therefore, kubectl apply will not "wake up" the ECK operator into managing a traditional Elasticsearch rolling restart.
Simultaneously deleting all pods is an approach that raises concerns within our team regarding the potential for downtime in our search services while deleted pods are reforming. (and likewise, Elasticsearch cluster health going yellow or red.)
Our team would have more peace of mind if the ECK operator could offer a builtin command to perform a fully managed eviction/restart of its StatefulSet - just the same operation triggered when we do kubectl apply with a changed Elasticsearch manifest, but in this case arbitrarily with no changes whatsoever. This would match our current blue-green procedure to upgrade the version of Kubernetes in nodes of all our other services.
@pebrc Don't you think this would be justifiable as a legitimate use case? Or would you be aware of other techniques to perform blue-green Kubernetes upgrades in ECK clusters with minimal downtime?
@arboliveira I believe your use case is absolutely legitimate. But I also think there is maybe some misunderstand regarding the terminology going on here.
When we say "full cluster restart" in Elasticsearch we mean basically a full shutdown of the cluster and then a restart of all Elasticsearch nodes. This happens "in place". So no moving to other infrastructure etc. Which is something that is quite hard to achieve on Kubernetes anyway and not what you are after.
What you can do today is to use kubectl drain or if you want to have a bit more fine grained control use the Eviction API This will respect the Pod Disruption Budget the operator sets for Elasticsearch clusters and limit evictions effectively to one Pod at a time. What is happening under hood is that Pods are still simply deleted and with that you would be relying on Elasticsearch's built-in recovery and resilience through index replicas to keep the cluster green or at least yellow. An operator orchestrated move of the Pods could do a bit more to communicate to Elasticsearch the fact that a Pod is about to go away. The eviction mechanism should still minimise the risk of losing availability compared to the approach you mentioned where Pods are just manually deleted one at a time because the PDB is taken into account, which the operator will adjust based on cluster health.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.