Elasticsearch cluster on ECK issues (503 Service Unavailable)

ailiev · July 17, 2024, 6:47am

Hi all,

We (my team) have been running Elasticsearch 8.13.2 under ECK 2.12.1 for the last
couple of weeks.

We are using it to collect logs from several Kubernetes clusters including the
one where Elasticsearch is running. The logs are shipped by Filbeat.

We somehow ended up in a situation that we don't know how to get out of - the
cluster is stuck in the ApplyingChanges phase. Most API requests return a 503
Service Unavailable response (e.g., PUT _cluster/settings, DELETE /<index>,
etc.) In the Elasticsearch pods' logs I see errors like this one:

{
  "@timestamp": "2024-07-16T13:44:24.740Z",
  "log.level": "WARN",
  "message": "path: /_internal/desired_nodes, params: {}, status: 503",
  "ecs.version": "1.2.0",
  "service.name": "ES_ECS",
  "event.dataset": "elasticsearch.server",
  "process.thread.name": "elasticsearch[logs-es-default-2][generic][T#1]",
  "log.logger": "rest.suppressed",
  "elasticsearch.cluster.uuid": "zm9eYp_nRXaa91CjpL_NJQ",
  "elasticsearch.node.id": "YlcqktgkTY2Kc0aykSt0gA",
  "elasticsearch.node.name": "logs-es-default-2",
  "elasticsearch.cluster.name": "logs",
  "error.type": "org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException",
  "error.message": "failed to process cluster event (delete-desired-nodes) within 30s",
  "error.stack_trace": "org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (delete-desired-nodes) within 30s\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"
}

and this one:

{
  "@timestamp": "2024-07-16T13:43:54.621Z",
  "log.level": "WARN",
  "message": "path: /_internal/desired_nodes, params: {}, status: 503",
  "ecs.version": "1.2.0",
  "service.name": "ES_ECS",
  "event.dataset": "elasticsearch.server",
  "process.thread.name": "elasticsearch[logs-es-default-1][transport_worker][T#1]",
  "log.logger": "rest.suppressed",
  "elasticsearch.cluster.uuid": "zm9eYp_nRXaa91CjpL_NJQ",
  "elasticsearch.node.id": "0W3fYyeRRiiomqfqMUoS9A",
  "elasticsearch.node.name": "logs-es-default-1",
  "elasticsearch.cluster.name": "logs",
  "error.type": "org.elasticsearch.transport.RemoteTransportException",
  "error.message": "[logs-es-default-2][10.244.163.182:9300][cluster:admin/desired_nodes/delete]",
  "error.stack_trace": "org.elasticsearch.transport.RemoteTransportException: [logs-es-default-2][10.244.163.182:9300][cluster:admin/desired_nodes/delete]\nCaused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (delete-desired-nodes) within 30s\n\tat org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.lang.Thread.run(Thread.java:1583)\n"
}

No errors are being reported, just warnings.

On the Elasticsearch clients side I see errors saying that the disks went beyond
the flood-watermark threshold and cannot be written to. This is the initial
problem that we faced and tried to solve that led us to the rest of the issues
described here.

So my question is - how do we get out of this situation? What can we do to bring
Elasticsearch back to a usable state?

The other issue is with the operator itself - I'm not sure what caused the
Elasticsearch cluster instance to go into the ApplyingChanges phase. The
last change that was applied to the cluster went fine and the cluster was in the
Ready phase after it was applied.

What might have triggered the phase change? I guess the operator detected some
drift between the desired state (the one specified in the Elasticsearch
resource manifest) and the actual state, but I wasn't able to find any trace in
its logs of the actual trigger.

Please let me know if you need any other details.

Thanks!

Topic		Replies	Views
Quickstart Deploy an Elasticsearch cluster get stuck in unknown Health & ApplyingChanges Phase Elastic Cloud on Kubernetes (ECK)	3	2742	March 22, 2021
The requested cluster is currently unavailable ID 3fd386 Elasticsearch	2	1421	October 26, 2017
Elasticsearch deploy on ec2 Elasticsearch	9	512	July 6, 2017
Elasticsearch deployment issue in ECK cluster Elastic Cloud on Kubernetes (ECK)	1	639	February 28, 2022
License error Elastic Cloud on Kubernetes (ECK)	6	7636	November 4, 2022

Elasticsearch cluster on ECK issues (503 Service Unavailable)

Related topics