Updating xpack certificates to a new CA in an active Elasticsearch cluster without downtime

Hi,

We have an active Elasticsearch cluster (large) with TLS/SSL enabled for inter-node communication (xpack.security.transport.ssl). We are planning to replace the existing node certificates with new ones issued by a different CA.

We want to make sure this process is safe and minimally disruptive. Our questions are:

  1. Can nodes with certificates from a new issuer join and operate in the cluster alongside nodes with the old certificates during the rotation?

  2. What is the recommended procedure for rolling out new certificates in a production cluster without causing downtime or cluster instability?

  3. Are there common pitfalls or best practices for this type of certificate rotation?

Any guidance, references, or examples from people who have performed a similar upgrade in production would be greatly appreciated.

Thanks in advance!

interesting question. I always suggest to just do it on a test cluster with same versions, using say VMs, to establish and validate the upgrade process.

It’d help know the Elasticsearch version, and see the full elasticsearch.yml - some oddities could be hidden in plain sight there that might be significant.

But in general, this should be doable without downtime, but will require rolling restarts (likely 2x). Key would be to establish trust for the new CA (in addition to existing CA) everywhere first, and only then rotate to new certificates. At least that's what I'd do. If you have done version upgrades before, it should be similar in terms of "disruption".

I would recommend that you check this documentation: Update TLS certificates | Elastic Docs

Changing the CA is a little more complicated, I would also recommend that you allocate a maintenance window where you may have some downtime.

And as mentioned, execute this process on a test cluster to validate all steps.

Any change carries risk, but it should be possible to do this without downtime. It will take several steps tho:

  1. Add the new CA to the transport TLS trust store on every node (so both CAs are trusted)
  2. Replace the transport certificate on every node
  3. Remove the old CA from the transport TLS trust store on every node.

You need to have an intermediate state when nodes trust certificates issued by both CAs so that you can update each node’s certificates one by one.

If you don’t do it right, it will fail fairly noisily on at least one node, so keep a close eye on every node’s logs. Such a failure should only affect one node and, if so, you should be able to revert the last step and fix whatever went wrong.

Thank you @RainTown, @leandrojmp and @DavidTurner for looking into this and sharing the suggestions.

We will try the suggestion shared by @DavidTurner and share whether it works without any downtime.

With our current config we validated this scenario in our test environment. We observed that when Elasticsearch nodes are updated with new certificates issued by a different CA, they fail to join the existing cluster that is still operating on the older certificates. This causes the updated nodes to operate independently instead of forming a unified cluster.

For reference, our current Elasticsearch configuration related to xpack security is:

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: <path_to_http_cert.pfx>
xpack.security.http.ssl.verification_mode: certificate

xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: <path_to_transport_cert.pfx>
xpack.security.transport.ssl.verification_mode: certificate

Without seeing the exact error it’s impossible to say for sure, but I expect this is because you didn’t reconfigure the nodes to trust both CAs first.

1 Like