Elasticsearch cluster on ECK issues (503 Service Unavailable)

Hi all,

We (my team) have been running Elasticsearch 8.13.2 under ECK 2.12.1 for the last
couple of weeks.

We are using it to collect logs from several Kubernetes clusters including the
one where Elasticsearch is running. The logs are shipped by Filbeat.

We somehow ended up in a situation that we don't know how to get out of - the
cluster is stuck in the ApplyingChanges phase. Most API requests return a 503
Service Unavailable response (e.g., PUT _cluster/settings, DELETE /<index>,
etc.) In the Elasticsearch pods' logs I see errors like this one:

{
  "@timestamp": "2024-07-16T13:44:24.740Z",
  "log.level": "WARN",
  "message": "path: /_internal/desired_nodes, params: {}, status: 503",
  "ecs.version": "1.2.0",
  "service.name": "ES_ECS",
  "event.dataset": "elasticsearch.server",
  "process.thread.name": "elasticsearch[logs-es-default-2][generic][T#1]",
  "log.logger": "rest.suppressed",
  "elasticsearch.cluster.uuid": "zm9eYp_nRXaa91CjpL_NJQ",
  "elasticsearch.node.id": "YlcqktgkTY2Kc0aykSt0gA",
  "elasticsearch.node.name": "logs-es-default-2",
  "elasticsearch.cluster.name": "logs",
  "error.type": "org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException",
  "error.message": "failed to process cluster event (delete-desired-nodes) within 30s",
  "error.stack_trace": "org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (delete-desired-nodes) within 30s\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"
}

and this one:

{
  "@timestamp": "2024-07-16T13:43:54.621Z",
  "log.level": "WARN",
  "message": "path: /_internal/desired_nodes, params: {}, status: 503",
  "ecs.version": "1.2.0",
  "service.name": "ES_ECS",
  "event.dataset": "elasticsearch.server",
  "process.thread.name": "elasticsearch[logs-es-default-1][transport_worker][T#1]",
  "log.logger": "rest.suppressed",
  "elasticsearch.cluster.uuid": "zm9eYp_nRXaa91CjpL_NJQ",
  "elasticsearch.node.id": "0W3fYyeRRiiomqfqMUoS9A",
  "elasticsearch.node.name": "logs-es-default-1",
  "elasticsearch.cluster.name": "logs",
  "error.type": "org.elasticsearch.transport.RemoteTransportException",
  "error.message": "[logs-es-default-2][10.244.163.182:9300][cluster:admin/desired_nodes/delete]",
  "error.stack_trace": "org.elasticsearch.transport.RemoteTransportException: [logs-es-default-2][10.244.163.182:9300][cluster:admin/desired_nodes/delete]\nCaused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (delete-desired-nodes) within 30s\n\tat org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.lang.Thread.run(Thread.java:1583)\n"
}

No errors are being reported, just warnings.

On the Elasticsearch clients side I see errors saying that the disks went beyond
the flood-watermark threshold and cannot be written to. This is the initial
problem that we faced and tried to solve that led us to the rest of the issues
described here.

So my question is - how do we get out of this situation? What can we do to bring
Elasticsearch back to a usable state?

The other issue is with the operator itself - I'm not sure what caused the
Elasticsearch cluster instance to go into the ApplyingChanges phase. The
last change that was applied to the cluster went fine and the cluster was in the
Ready phase after it was applied.

What might have triggered the phase change? I guess the operator detected some
drift between the desired state (the one specified in the Elasticsearch
resource manifest) and the actual state, but I wasn't able to find any trace in
its logs of the actual trigger.

Please let me know if you need any other details.

Thanks!