Cluster does not start after updating certificates

Our cluster's CA and node certificates have expired. We updated them by following these steps: Expired ca.crt/nodes certificates - how to renew such certificates? - Elastic Stack / Elasticsearch - Discuss the Elastic Stack

Restarted the docker containers, and now when launched, the following logs appear:


{"type": "server", "timestamp": "2024-10-14T13:50:15,373Z", "level": "ERROR", "component": "o.e.x.s.a.e.NativeUsersStore", "cluster.name": "es-itglobal-cluster", "node.name": "es02-ds1", "message": "security index is unavailable. short circuiting retrieval of user [user_name]" }
{"type": "server", "timestamp": "2024-10-14T13:50:15,762Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-itglobal-cluster", "node.name": "es02-ds1", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{es02-ds1}{...}{IP}{IP:9300}{...}{ml.machine_memory=35334332416, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {es03-ds1}{...}{IP}{IP}{...}{ml.machine_memory=33270734848, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, {es01-ds1}{....}{IP}{IP}{...}{ml.machine_memory=35334332416, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}]; discovery will continue using [10.32.0.82:9300, 10.32.0.84:9300, 10.32.0.120:9300] from hosts providers and [{es02-ds1}{...}{10.32.0.83}{10.32.0.83:9300}{...}{ml.machine_memory=35334332416, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

(Deleted IDs & some IPs)

I tried to restore the old certificates and start the nodes. It shows the following:

"stacktrace": ["io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed"
...
"Caused by: java.security.cert.CertificateExpiredException: NotAfter: Fri Aug 09 08:17:38 UTC 2024",

How can we start the cluster?

You're missing the relevant logs. These are a symptom, not a cause.

It looks like you have misconfigured your nodes so that they no longer trust one another and cannot form a cluster.

Maybe the reason was that elasticsearch containers were recreated while the playbook was running? I didn't change the cluster configuration.

We use elasticsearch/elasticsearch:7.10.2 image.

I found only such entries of the ERROR level logs:

"level": "ERROR", "component": "o.e.x.s.a.e.NativeUsersStore", "cluster.name": "es-cluster", "node.name": "es03", "message": "security index is unavailable. short circuiting retrieval of user [user]"

Besides this, there are a lot of logs. For example:

"level": "WARN", "component": "r.suppressed", "cluster.name": "es-cluster", "node.name": "es03", "message": "path: /_bulk, params: {}", 
"stacktrace": ["org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized, SERVICE_UNAVAILABLE/2/no master];

Please tell me what to look for? Or should I attach them all at once? They can't contain sensitive information?

You're looking something immediately after startup that tells you why your nodes are not connecting to each other.

Given you changed certificates, that's probably an SSL exception, but in theory it could be any number of networking issues or configuration errors.