Hello everyone,
I am looking for guidance on safely enabling security on an existing large Elasticsearch cluster.
Current Environment
-
Elasticsearch version: 8.17.6
-
Deployment type: VM-based cluster (not Kubernetes)
-
Master nodes: 3 dedicated master-eligible nodes
-
Total nodes: 227 nodes
- Hot / Warm / Cold tiers
-
Total data size: ~220 TB
-
Cluster is currently running with:
`xpack.security.enabled: false` `xpack.security.transport.ssl.enabled: false` `xpack.security.http.ssl.enabled: false`
Requirement
-
Enable security (authentication + transport TLS)
-
Zero downtime requirement (production cluster)
-
Avoid cluster instability or shard reallocation storms
What We Tried
We attempted a rolling restart approach:
-
Updated
elasticsearch.ymlon all nodes to enable:xpack.security.enabled: true xpack.security.transport.ssl.enabled: true xpack.security.http.ssl.enabled: true -
Restarted one master node first
Result:
- Restarted master node started successfully.
- However, it could not form quorum.
- Observed repeated errors:
master not discovered or elected yet,
an election requires at least 2 nodes...
have only discovered non-quorum [...]
This appears to be caused by transport TLS mismatch during rolling restart (secured node cannot communicate with unsecured masters).
To recover cluster stability, we reverted security settings and restarted.
Questions
- Is there a supported zero-downtime method to enable security (transport TLS + auth) on an existing large cluster?
- Is rolling restart officially supported for enabling transport TLS on an already-running unsecured cluster?
- For clusters of this size (227 nodes / 220TB), is the recommended approach:
- Full cluster restart?
- Blue/Green migration?
- Snapshot + restore?
- Cross-cluster replication?
- Are there any best-practice guidelines specifically for large clusters when enabling security post-deployment?
Additional Notes
- We understand that security settings are static and require node restart.
- We also understand that transport TLS must be uniform across all nodes.
- Our main concern is maintaining quorum stability and avoiding large-scale shard reallocations during migration.
We would appreciate guidance from anyone who has enabled security on an existing large production cluster.
Thank you in advance.