Elastic search(elasticsearch-8.9.2) stopping abruptly

we are using elasticsearch-8.9.2 for our application on azure VM with 64GB RAM out of which 32 GB RAM is set for Elastic search. Elastic search keep on failing with below exceptions.

 [2025-10-09T12:09:05,309][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [dev-app03] GC did not bring memory usage down, before [26240126904], after [26248508544], allocations [372], duration [299][2025-10-09T12:09:10,309][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [dev-app03] attempting to trigger G1GC due to high heap usage [26332394624][2025-10-09T12:09:10,605][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [dev-app03] GC did not bring memory usage down, before [26332394624], after [26347787800], allocations [366], duration [296][2025-10-09T12:09:15,605][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [dev-app03] attempting to trigger G1GC due to high heap usage [26448451096][2025-10-09T12:09:15,889][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [dev-app03] GC did bring memory usage down, before [26448451096], after [26448229384], allocations [359], duration [284][2025-10-09T12:09:18,303][INFO ][o.e.n.Node               ] [dev-app03] stopping ...[2025-10-09T12:09:18,304][INFO ][o.e.c.f.AbstractFileWatchingService] [dev-app03] shutting down watcher thread[2025-10-09T12:09:18,306][INFO ][o.e.c.f.AbstractFileWatchingService] [dev-app03] watcher service stopped[2025-10-09T12:09:18,307][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [dev-app03] [controller/21796] [Main.cc@176] ML controller exiting[2025-10-09T12:09:18,308][INFO ][o.e.x.w.WatcherService   ] [dev-app03] stopping watch service, reason [shutdown initiated][2025-10-09T12:09:18,308][INFO ][o.e.x.m.p.NativeController] [dev-app03] Native controller process has stopped - no new native processes can be started[2025-10-09T12:09:18,309][INFO ][o.e.x.w.WatcherLifeCycleService] [dev-app03] watcher has stopped and shutdown

[2025-10-09T17:22:10,033][WARN ][r.suppressed ] [dev-app03] path: /.kibana_8.9.2/_doc/space%3Adefault, params: {index=.kibana_8.9.2, id=space:default}org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana_8.9.2][space:default]: routing [null]]at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:201) ~[elasticsearch-8.9.2.jar:?]at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:178) ~[elasticsearch-8.9.2.jar:?]at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:97) ~[elasticsearch-8.9.2.jar:?]at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:50) ~[elasticsearch-8.9.2.jar:?]at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:86) ~[elasticsearch-8.9.2.jar:?]

[2025-10-09T17:22:11,574][WARN ][r.suppressed ] [dev-app03] path: /.kibana_8.9.2/_doc/space%3Adefault, params: {index=.kibana_8.9.2, id=space:default}org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana_8.9.2][space:default]: routing [null]]at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:201) ~[elasticsearch-8.9.2.jar:?]at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.onFailure(TransportSingleShardAction.java:186) ~[elasticsearch-8.9.2.jar:?]at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$1.handleException(TransportSingleShardAction.java:231) ~[elasticsearch-8.9.2.jar:?]at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1419) ~[elasticsearch-8.9.2.jar:

Hi @Nataraj,

Welcome! That GC error suggests that you're hitting a circuit breaker. What is the health and JVM usage of your cluster?

GET _cluster/health
GET _cat/nodes?v=true&h=name,node*,heap*
GET _nodes/stats/breaker

What type of query/ operation are you running when this is triggered? It might be down to the query size.

The below resources may also help you investigate and remediate the problem:

Hope that helps!

Hi @carly.richmond

Thanks for the links , we will go through these links and below are the stats you asked for, let us know if any configuration need to be update. The Elastic search will be up for few min and goes down automatically without any query execution

_cluster/health
{"cluster_name": "elasticsearch","status": "yellow","timed_out": false,"number_of_nodes": 1,"number_of_data_nodes": 1,"active_primary_shards": 1027,"active_shards": 1027,"relocating_shards": 0,"initializing_shards": 0,"unassigned_shards": 26,"delayed_unassigned_shards": 0,"number_of_pending_tasks": 0,"number_of_in_flight_fetch": 0,"task_max_waiting_in_queue_millis": 0,"active_shards_percent_as_number": 97.5308641975309}
_cat/nodes?v=true&h=name,node*,heap*
name      id   node.role   heap.current heap.percent heap.maxdev-app03 mkWw cdfhilmrstw       11.4gb           57     20gb
_nodes/stats/breaker
{"_nodes": {"total": 1,"successful": 1,"failed": 0},"cluster_name": "elasticsearch","nodes": {"mkWwZwQaSCO7LDnqYLwdGg": {"timestamp": 1760024486223,"name": "dev-app03","transport_address": "127.0.0.1:9300","host": "127.0.0.1","ip": "127.0.0.1:9300","roles": ["data","data_cold","data_content","data_frozen","data_hot","data_warm","ingest","master","ml","remote_cluster_client","transform"],"attributes": {"xpack.installed": "true","ml.allocated_processors_double": "16.0","ml.max_jvm_size": "21474836480","ml.allocated_processors": "16","ml.machine_memory": "68718301184"},"breakers": {"eql_sequence": {"limit_size_in_bytes": 10737418240,"limit_size": "10gb","estimated_size_in_bytes": 0,"estimated_size": "0b","overhead": 1,"tripped": 0},"model_inference": {"limit_size_in_bytes": 10737418240,"limit_size": "10gb","estimated_size_in_bytes": 0,"estimated_size": "0b","overhead": 1,"tripped": 0},"inflight_requests": {"limit_size_in_bytes": 21474836480,"limit_size": "20gb","estimated_size_in_bytes": 0,"estimated_size": "0b","overhead": 2,"tripped": 0},"request": {"limit_size_in_bytes": 12884901888,"limit_size": "12gb","estimated_size_in_bytes": 0,"estimated_size": "0b","overhead": 1,"tripped": 0},"fielddata": {"limit_size_in_bytes": 8589934592,"limit_size": "8gb","estimated_size_in_bytes": 1633656,"estimated_size": "1.5mb","overhead": 1.03,"tripped": 0},"parent": {"limit_size_in_bytes": 21474836480,"limit_size": "20gb","estimated_size_in_bytes": 12401602944,"estimated_size": "11.5gb","overhead": 1,"tripped": 0}}}}} 

Thanks for sharing @Nataraj. It looks like your cluster is in a yellow state, which could be the problem:

"cluster_name": "elasticsearch","status": "yellow",
"unassigned_shards": 26

Can you explain how much data you are storing, how many indices you have and for how long you are storing data (ILM settings). I would also recommend looking at the Red or yellow cluster health status documentation to diagnose and fix the issue.

Hope that helps!

1027 active shards
26 unassigned
1027 - 26 = 1001 ~= 1000

Isn't the shard limit per node 1000?

Have you increased this via something similar to:

PUT _cluster/settings
{
  "persistent" : {
    "cluster.max_shards_per_node" : 2000 
  }
}

If not, you can do that.

BUT, there's reasons 1000 shards per node is a reasonable default.

It's not so clear is why your heap usage is going up so quickly. But if the 1000 limit is an issue, and you can fix it and get down to zero unassigned shards, we can maybe then probe that further. If so, please post output from GET /_tasks after the system has been up for a few minutes, but before it crashes (obviously).

Some client might just be retrying a very expensive query/aggregation every X seconds/minutes and whenever cluster gets up and running, even yellow, it then has to try to process that and gets into a tailspin.

EDIT: Your cluster is yellow, so all primary shards must be allocated. So the unassigned shard must be a replica. And with one node there's nowhere to put the replica.

Hi @carly.richmond : We do not have ILM setting as we need the data always, we have 982 indices and total of 280GB of data.

Hi @RainTown : Thanks for jumping in, Here is the output of GET /_task

{
  "nodes": {
    "mkWwZwQaSCO7LDnqYLwdGg": {
      "name": "dev-app03",
      "transport_address": "127.0.0.1:9300",
      "host": "127.0.0.1",
      "ip": "127.0.0.1:9300",
      "roles": [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "xpack.installed": "true",
        "ml.allocated_processors_double": "16.0",
        "ml.max_jvm_size": "21474836480",
        "ml.allocated_processors": "16",
        "ml.machine_memory": "68718301184"
      },
      "tasks": {
        "mkWwZwQaSCO7LDnqYLwdGg:1636": {
          "node": "mkWwZwQaSCO7LDnqYLwdGg",
          "id": 1636,
          "type": "persistent",
          "action": "health-node[c]",
          "start_time_in_millis": 1760010730065,
          "running_time_in_nanos": 25737504835300,
          "cancellable": true,
          "cancelled": false,
          "parent_task_id": "cluster:197",
          "headers": {

          }
        },
        "mkWwZwQaSCO7LDnqYLwdGg:1637": {
          "node": "mkWwZwQaSCO7LDnqYLwdGg",
          "id": 1637,
          "type": "persistent",
          "action": "geoip-downloader[c]",
          "start_time_in_millis": 1760010730065,
          "running_time_in_nanos": 25737503735100,
          "cancellable": true,
          "cancelled": false,
          "parent_task_id": "cluster:198",
          "headers": {

          }
        },
        "mkWwZwQaSCO7LDnqYLwdGg:1871874": {
          "node": "mkWwZwQaSCO7LDnqYLwdGg",
          "id": 1871874,
          "type": "transport",
          "action": "cluster:monitor/tasks/lists",
          "start_time_in_millis": 1760036467617,
          "running_time_in_nanos": 898399,
          "cancellable": true,
          "cancelled": false,
          "headers": {

          }
        },
        "mkWwZwQaSCO7LDnqYLwdGg:1871875": {
          "node": "mkWwZwQaSCO7LDnqYLwdGg",
          "id": 1871875,
          "type": "direct",
          "action": "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis": 1760036467617,
          "running_time_in_nanos": 775400,
          "cancellable": true,
          "cancelled": false,
          "parent_task_id": "mkWwZwQaSCO7LDnqYLwdGg:1871874",
          "headers": {

          }
        }
      }
    }
  }
}

there's nothing interesting in _tasks, except:

        "mkWwZwQaSCO7LDnqYLwdGg:1637": {
          "node": "mkWwZwQaSCO7LDnqYLwdGg",
          "id": 1637,
          "type": "persistent",
          "action": "geoip-downloader[c]",
          "start_time_in_millis": 1760010730065,
          "running_time_in_nanos": 25737503735100,
          "cancellable": true,
          "cancelled": false,
          "parent_task_id": "cluster:198",
          "headers": {

          }
        }

25737503735100 nano seconds is 7+ hours.

Keep looking at the _tasks output as your JVM usage increases.