Enabling auth after upgrading elasticsearch cluster from 7.17.9 to 8.17.6

Hello everyone,

I am looking for guidance on safely enabling security on an existing large Elasticsearch cluster.

Current Environment

  • Elasticsearch version: 8.17.6

  • Deployment type: VM-based cluster (not Kubernetes)

  • Master nodes: 3 dedicated master-eligible nodes

  • Total nodes: 227 nodes

    • Hot / Warm / Cold tiers
  • Total data size: ~220 TB

  • Cluster is currently running with:

    `xpack.security.enabled: false`
    `xpack.security.transport.ssl.enabled: false`
    `xpack.security.http.ssl.enabled: false`
    

Requirement

  • Enable security (authentication + transport TLS)

  • Zero downtime requirement (production cluster)

  • Avoid cluster instability or shard reallocation storms

What We Tried

We attempted a rolling restart approach:

  1. Updated elasticsearch.yml on all nodes to enable:

    xpack.security.enabled: true
    xpack.security.transport.ssl.enabled: true
    xpack.security.http.ssl.enabled: true
    
  2. Restarted one master node first
    Result:

  • Restarted master node started successfully.
  • However, it could not form quorum.
  • Observed repeated errors:
master not discovered or elected yet,
an election requires at least 2 nodes...
have only discovered non-quorum [...]

This appears to be caused by transport TLS mismatch during rolling restart (secured node cannot communicate with unsecured masters).

To recover cluster stability, we reverted security settings and restarted.

Questions

  1. Is there a supported zero-downtime method to enable security (transport TLS + auth) on an existing large cluster?
  2. Is rolling restart officially supported for enabling transport TLS on an already-running unsecured cluster?
  3. For clusters of this size (227 nodes / 220TB), is the recommended approach:
  • Full cluster restart?
  • Blue/Green migration?
  • Snapshot + restore?
  • Cross-cluster replication?
  1. Are there any best-practice guidelines specifically for large clusters when enabling security post-deployment?

Additional Notes

  • We understand that security settings are static and require node restart.
  • We also understand that transport TLS must be uniform across all nodes.
  • Our main concern is maintaining quorum stability and avoiding large-scale shard reallocations during migration.

We would appreciate guidance from anyone who has enabled security on an existing large production cluster.

Thank you in advance.

These two points are correct, but together they imply that there’s no way to do what you want. Enabling transport-level TLS requires a full cluster restart.

What would be the best way to implement auth than, can you please explain.?

Given that a full restart is required, what is the safest operational way to execute it on a 227-node / 220TB cluster? I tired this on a smaller cluster but after the config changes I restarted elasticsearch service on one of the node, it was not joining the cluster, I had to revert back the changes.

A full cluster restart means that you need to restart all nodes at the same time. Nodes that have security enabled can not communicate with nodes that do not have security enabled and vice versa.

Yes what Christian said, you have to stop all the nodes, update the setting, then start all the nodes again.