ECK Elasticsearch - Performing Full Restart of ES Nodes

Hello hello,
I would like to ask you whether is possible to perform full restart of ES nodes using ECK operator.
Is it possible to do using this?

      maxSurge: <SOME VALUE>
      maxUnavailable: <SOME VALUE>
      minAvailable: <SOME VALUE>
 {{ .Values.elasticsearch.reference }}

Could you please help me how to correctly set?

No there is no support for a full cluster restart in the ECK operator. Why do you want to do a full cluster restart? There should be under normal circumstances no reason to do that. There are however exceptional circumstances where it might be necessary to restart individual Elasticsearch nodes for example if you are running into a bug in Elasticsearch or something of that sort. In such cases you can force restart a node by just deleting the corresponding Pod. The operator or more precisely the StatefulSet controller will immediately recreate it.

1 Like

Hello @pebrc . Thank you for quick answer. I have this use-case:
Using helm I must be able to turn off / turn on security authentication. It works well using

config: false # or true

When I deploy ES cluster with no security authentication, deployment of ES nodes works well.
When I set variable in helm to enable authentication, ES nodes are redeployed and everything is OK and secured. But when I decided to turn off security, only one ES node of 3 is going to restart.. So the result is that only one node has security turned off. I have to do manual restart - it works out... So only for this usecase I would like to use different updateStrategy.

Or do you have any better solution?

I am confused are you using the ECK operator or are you using the Elastic helm chart for Elasticsearch?

If you are using the ECK operator then turning off is not supported. This is a setting that is managed by the operator. You can however configure anonymous access if that is required.

I am using ECK operator. I have own helm chart where I have specified Elasticsearch kind manifest yaml. Sry for the inaccurate description.

kind: Elasticsearch

I can confirm that turning off /on of security using works (except in this case :slight_smile: ). Although I found in ECK operator logs this line:

{"log.level":"info","@timestamp":"2022-08-24T11:31:19.255Z","log.logger":"elasticsearch-controller","message":"Elasticsearch manifest has warnings. Proceed at your own risk. [spec.nodeSets[0] Forbidden: Configuration setting is reserved for internal use. User-configured use is unsupported, spec.nodeSets[0] Forbidden: Configuration setting is reserved for internal use. User-configured use is unsupported, spec.nodeSets[0] Forbidden: Configuration setting is reserved for internal use. User-configured use is unsupported]"...

I continued with this setting because I found this comment ECK - Xpack settings - Forbidden: Configuration setting is reserved for internal use - #4 by michael.morello.

Oukej , I will try the anonymous access.

You are right we are only logging warnings if you use one of the reserved settings Settings managed by ECK | Elastic Cloud on Kubernetes [2.4] | Elastic and are not blocking you, because there might be exceptional cases where you need to use one of the settings. is a special case where you will find it hard if not impossible to change it because your will basically take away the ability of the nodes to talk to each other other during a the rolling upgrade (this is why in your experiment only one node was upgraded). Even if you were to force the upgrade through by manually deleting all Pods you would still end up with all the readiness probes failing and thus no endpoints to talk to. In short it would be an uphill battle and we actually want users to keep their clusters secure thus the default to true

1 Like

Thank you very much Peter @pebrc. In case of manually deleting all pods everything works well - readiness probes didn't fail.
:bulb: In case of Kibana I had to edit readinessProbe and it works out:

        - name: kibana
          image: "{{ .Values.acr.url }}/kibana/kibana:{{ .Values.kibana.version }}"
          {{- if eq true }}
              path: /login
              port: 5601
              scheme: HTTPS
          {{- else }}
              path: /
              port: 5601
              scheme: HTTPS
          {{- end }}

As this will be a fairly isolated case, a manual solution of deleting pods will be sufficient.

No there is no support for a full cluster restart in the ECK operator. Why do you want to do a full cluster restart? There should be under normal circumstances no reason to do that.

In our operation we do have a reason to do that under normal circumstances: we regularly perform Kubernetes version upgrades on our live clusters, with as little downtime as possible, and now with the introduction of ECK we need to do the same on nodes where an ECK cluster is running. We adopt a blue-green approach to upgrade the version of K8s in several of our non-ECK services:

  1. Cordon the current node pool
  2. Create a second node pool of the same size, with new K8s version
  3. Use kubectl set env (thus setting in motion a K8s rolling update)
  4. Monitor until all nodes migrate from the first pool to the second pool
  5. Drain the old pool.

Since step 3 is not an option with an ECK cluster, what should we do instead?

  • The ECK StatefulSet has an updateStrategy of OnDelete, so we cannot trigger an automatic K8s rolling update.
  • Our Elasticsearch manifest is not changing at all (only the K8s version is changing at node level) therefore, kubectl apply will not "wake up" the ECK operator into managing a traditional Elasticsearch rolling restart.
  • Simultaneously deleting all pods is an approach that raises concerns within our team regarding the potential for downtime in our search services while deleted pods are reforming. (and likewise, Elasticsearch cluster health going yellow or red.)

Our team would have more peace of mind if the ECK operator could offer a builtin command to perform a fully managed eviction/restart of its StatefulSet - just the same operation triggered when we do kubectl apply with a changed Elasticsearch manifest, but in this case arbitrarily with no changes whatsoever. This would match our current blue-green procedure to upgrade the version of Kubernetes in nodes of all our other services.

@pebrc Don't you think this would be justifiable as a legitimate use case? Or would you be aware of other techniques to perform blue-green Kubernetes upgrades in ECK clusters with minimal downtime?

@arboliveira I believe your use case is absolutely legitimate. But I also think there is maybe some misunderstand regarding the terminology going on here.

When we say "full cluster restart" in Elasticsearch we mean basically a full shutdown of the cluster and then a restart of all Elasticsearch nodes. This happens "in place". So no moving to other infrastructure etc. Which is something that is quite hard to achieve on Kubernetes anyway and not what you are after.

What you are trying to achieve is more of an orchestrated re-scheduling of Elasticsearch Pods. We have been thinking about this. This issue touches on this problem for example Better Documentation for Performing Manual Rolling Restarts (of underlying hosts) of Persistent Clusters · Issue #5305 · elastic/cloud-on-k8s · GitHub. However there is not support for this yet in the operator.

What you can do today is to use kubectl drain or if you want to have a bit more fine grained control use the Eviction API This will respect the Pod Disruption Budget the operator sets for Elasticsearch clusters and limit evictions effectively to one Pod at a time. What is happening under hood is that Pods are still simply deleted and with that you would be relying on Elasticsearch's built-in recovery and resilience through index replicas to keep the cluster green or at least yellow. An operator orchestrated move of the Pods could do a bit more to communicate to Elasticsearch the fact that a Pod is about to go away. The eviction mechanism should still minimise the risk of losing availability compared to the approach you mentioned where Pods are just manually deleted one at a time because the PDB is taken into account, which the operator will adjust based on cluster health.

Hope that helps!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.