Elastic cluster is getting down after 2 - 3 hours

Hi Everyone.

I am using elastic 8.13.4 and I have 3 machines with 30 gb of RAM and 1tb of hard disk for each machine. I am creating 2 nodes per each machine through elastic portable download ealsticsearch-8.13.4.tar.gz. Each node was given xms & xmx of 9GB in java.options file. The cluster was working file for months without any issue. and I am deleting the indices manually through index management from kibana to maintain data of 6 months. However, recently the cluster was getting down after about 2-3 hours. I am not able to find the exact issue. The logs are saying

[2025-12-10T12:16:02,954][INFO ][o.e.m.j.JvmGcMonitorService] [Kale] [gc][361] overhead, spent [437ms] collecting in the last [1s]
[2025-12-10T12:16:40,008][INFO ][o.e.m.j.JvmGcMonitorService] [Kale] [gc][398] overhead, spent [278ms] collecting in the last [1s]
[2025-12-10T12:16:45,009][INFO ][o.e.m.j.JvmGcMonitorService] [Kale] [gc][403] overhead, spent [286ms] collecting in the last [1s]
[2025-12-10T15:04:23,451][INFO ][o.e.x.m.p.NativeController] [Kale] Native controller process has stopped - no new native processes can be started
[2025-12-10T15:04:23,537][INFO ][o.e.n.Node ] [Kale] stopping ...
[2025-12-10T15:04:23,540][INFO ][o.e.x.w.WatcherService ] [Kale] stopping watch service, reason [shutdown initiated]
[2025-12-10T15:04:23,541][INFO ][o.e.x.w.WatcherLifeCycleService] [Kale] watcher has stopped and shutdown
[2025-12-10T15:04:23,595][INFO ][o.e.c.c.Coordinator ] [Kale] master node [{Europa}{asrawfdsvdzvdc}{asrawfdsvdzvdc}{Europa}{x.x.x.x.}{x.x.x.x.:9300}{dm}{8.13.4}{7000099-8503000}] disconnected, restarting discovery
[2025-12-10T15:04:23,608][INFO ][o.e.h.AbstractHttpServerTransport] [Kale] channel [Netty4HttpChannel{localAddress=/x.x.x.x.:9201, remoteAddress=/x.x.x.y:64422}] already closed
[2025-12-10T15:04:30,351][INFO ][o.e.n.Node ] [Kale] stopped
[2025-12-10T15:04:30,351][INFO ][o.e.n.Node ] [Kale] closing ...
[2025-12-10T15:04:30,444][INFO ][o.e.n.Node ] [Kale] closed

[2025-12-10T14:51:19,565][INFO ][o.e.c.r.a.AllocationService] [Europa] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[perfmon_adc01-2025.12.05][0]]])." previous.health="YELLOW" reason="shards started [[perfmon_adc01-2025.12.05][0]]"
[2025-12-10T14:51:23,259][INFO ][o.e.c.m.MetadataCreateIndexService] [Europa] [perfmon_web02-2025.11.26] creating index, cause [auto(bulk api)], templates , shards [1]/[1]
[2025-12-10T15:04:23,310][WARN ][o.e.c.s.MasterService ] [Europa] took [13m/780053ms] to compute cluster state update for [auto create [perfmon_web02-2025.11.26][org.elasticsearch.action.admin.indices.create.AutoCreateAction$TransportAction$CreateIndexTask@67c24fce]], which exceeds the warn threshold of [10s]
[2025-12-10T15:04:23,453][INFO ][o.e.c.m.MetadataCreateIndexService] [Europa] [perfmon_adc02-2025.12.05] creating index, cause [auto(bulk api)], templates , shards [1]/[1]
[2025-12-10T15:04:23,465][INFO ][o.e.x.m.p.NativeController] [Europa] Native controller process has stopped - no new native processes can be started
[2025-12-10T15:04:23,538][INFO ][o.e.n.Node ] [Europa] stopping ...
[2025-12-10T15:04:23,544][INFO ][o.e.c.f.AbstractFileWatchingService] [Europa] shutting down watcher thread
[2025-12-10T15:04:23,560][INFO ][o.e.c.f.AbstractFileWatchingService] [Europa] watcher service stopped
[2025-12-10T15:04:23,561][INFO ][o.e.x.w.WatcherService ] [Europa] stopping watch service, reason [shutdown initiated]
[2025-12-10T15:04:23,563][INFO ][o.e.x.w.WatcherLifeCycleService] [Europa] watcher has stopped and shutdown
[2025-12-10T15:04:23,613][INFO ][o.e.t.ClusterConnectionManager] [Europa] transport connection to [{Kale}{asrawfdsvdzvdc}{asrawfdsvdzvdc-cGg}{Kale}{x.x.x.x.102}{x.x.x.x.102:9301}{d}{8.13.4}{7000099-8503000}] closed by remote
[2025-12-10T15:04:23,715][WARN ][o.e.c.NodeConnectionsService] [Europa] failed to connect to {Kale}{asrawfdsvdzvdc}{asrawfdsvdzvdc-cGg}{Kale}{x.x.x.x.102}{x.x.x.x.102:9301}{d}{8.13.4}{7000099-8503000}{xpack.installed=true, ml.config_version=12.0.0, transform.config_version=10.0.0} (tried [1] times)

What is the resolution for this issue? Please help me here

Please let me know when I start the cluster again. What are the things that I need to look at?

how many indices do you have? How many nodes do you have?

What does a GET on following return:

_cat/nodes?v&h=name,ip,role,version,master,u,cpu,rc,rm,rp,hc,hm,hp,load_1m,load_5m,load_15m&bytes=b

and

_cat/indices?index=*,.*&h=index,health,dc,ss,cd&s=cd&bytes=b

Also,

[2025-12-10T15:04:23,310][WARN ][o.e.c.s.MasterService    ] [Europa] took [13m/780053ms] to compute cluster state update for [auto create [perfmon_web02-2025.11.26][org.elasticsearch.action.admin.indices.create.AutoCreateAction$TransportAction$CreateIndexTask@67c24fce]], which exceeds the warn threshold of [10s]

why was it trying to auto-create an index, with 2025.11.26 in the index name, on 2025-12-10 ? In any case that it took 13 minutes (sic) is indicator of a significant problem.

Just to be clear, these 2 "nodes" are just different Elastic (java) processes running directly on the same host, or there is some other layer involved (virtual machines, containers, whatever).

Please check all nodes logs for ERROR and WARN before the crashes. Also check the system logs for any limits you maybe have reached, OOMs, or similar.

To get a better view of the state of the cluster it would be great if you could post the full output of the cluster stats API.

Each node should have no more than 50% of the memory available to it assigned to heap. If you have 2 nodes and a total of 30GB RAM (assuming no other processes are running on the host that consume significant resources) your heap size should not be larger than 7.5GB.

Elasticsearch also asumes it has full access to all vailable RAM so if you are using VMs it would be useful to verify that these are not overprovisioned so memory usage results in swapping behind the scenes (which could lead to long GC times).

It would also help to know exactly what type of storage you are using as this is a common cause of performance problems. Is it local SSD, local HDD or some type of networked storage?

Below is the response for this api:

name ip role version master u cpu rc rm rp hc hm hp load_1m load_5m load_15m
Callisto x.x.x.y d 8.13.4 - 10.2m 41 32977051648 33499344896 98 3987406536 9663676416 41 2.95 4.61 2.90
Elara x.x.x.y dm 8.13.4 - 11.4m 41 32954441728 33499344896 98 4746561920 9663676416 49 2.95 4.61 2.90
Kore x.x.x.z dm 8.13.4 - 11.3m 16 32517181440 33543311360 97 7084179456 9663676416 73 1.24 3.56 2.51
Europa x.x.x.x dm 8.13.4 * 11.5m 15 33251794944 33499344896 99 3271245552 9663676416 33 0.94 2.44 1.75
Kale x.x.x.x d 8.13.4 - 10.3m 15 33253756928 33499344896 99 6425673728 9663676416 66 0.94 2.44 1.75
Amalthea x.x.x.z d 8.13.4 - 10.1m 16 32519208960 33543311360 97 4173332480 9663676416 43 1.24 3.56 2.51

There are totally 1501 indices

I am differentiating indices based on date. So it would create index with date as suffix

I have the machine as a VM hypervisor. and running the elastic 2 nodes as a seperate processes. Not as a docker and there are no other layers are involved. And there are no ERROR or WARN logs apart from above mentioned before the crash

I have also tried setting the xms and xmx value to 7GB and I still see the same issue

The Harddisk type is HDD

and #shards and #replicas per index?

The specific command sorted the indices in create date order. Look at it and see if it makes sense, you know your data.

2 shards/index, 1 replicas, would be 6000+ shards. IIRC default limit is 1000 shards/node. and you have 6 nodes. Suggestive, if a wild guess.

This tends to create loads of indices of wildly different sizes, which is not good as your cluster state file gets big. 1500 indices is big.

Yeah, not great.

The point about memory is that running both "nodes" on same host means they basically compete with each other for the same memory. Also, make sure you have no swap space please.

Did you have 1500 indices a few months ago?

1 Like

... and they are also already competing for same (limited) disk bandwidth / IOPS.

tbh you may be better off with a 3-node cluster.

1 Like

There are 2 shards 1 is primary and 1 is replica. So totally shards double the indices.

@RainTown
We have been using this since 1 year now. I have never faced this issue so far.

No even more around 3K. I had to trim down the indices due to this issue.

OK. Please share this:

This is indeed not great and can be the source af all kinds of performance problems, especially if you have a large number of small shards.

It would be useful if you could use e.g. iostat to check await and disk utilisation to see if this is indeed a problem.

@RainTown Is there any way I can share the text file. I am not able to find upload file option here

I think new users maybe have some restrictions? In the editor is the upload button:

If you want, you can send to me in a DM and I'll paste for you. Likely it's a large text file so maybe better to use pastebin or similar, and just paste the link.

{
  "_nodes" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "cluster_name" : "PerfElastic",
  "cluster_uuid" : "demoUUID",
  "timestamp" : 1765452760288,
  "status" : "green",
  "indices" : {
    "count" : 1512,
    "shards" : {
      "total" : 3024,
      "primaries" : 1512,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 2,
          "avg" : 2.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 357120918,
      "deleted" : 474
    },
    "store" : {
      "size_in_bytes" : 425100974480,
      "total_data_set_size_in_bytes" : 425100974480,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 17656,
      "evictions" : 0,
      "global_ordinals" : {
        "build_time_in_millis" : 549
      }
    },
    "query_cache" : {
      "memory_size_in_bytes" : 0,
      "total_count" : 1773,
      "hit_count" : 0,
      "miss_count" : 1773,
      "cache_size" : 0,
      "cache_count" : 0,
      "evictions" : 0
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 54589,
      "memory_in_bytes" : 0,
      "terms_memory_in_bytes" : 0,
      "stored_fields_memory_in_bytes" : 0,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 0,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 0,
      "index_writer_memory_in_bytes" : 151196684,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 726976,
      "max_unsafe_auto_id_timestamp" : 1765449072842,
      "file_sizes" : { }
    },
    "mappings" : {
      "total_field_count" : 10308141,
      "total_deduplicated_field_count" : 160120,
      "total_deduplicated_mapping_size_in_bytes" : 762897,
      "field_types" : [
        {
          "name" : "alias",
          "count" : 385642,
          "index_count" : 1075,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 82865,
          "index_count" : 1090,
          "script_count" : 0
        },
        {
          "name" : "byte",
          "count" : 1077,
          "index_count" : 1077,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 3225,
          "index_count" : 1075,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 112401,
          "index_count" : 1498,
          "script_count" : 0
        },
        {
          "name" : "date_range",
          "count" : 10,
          "index_count" : 10,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 364105,
          "index_count" : 1077,
          "script_count" : 0
        },
        {
          "name" : "flattened",
          "count" : 14077,
          "index_count" : 1087,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 334137,
          "index_count" : 1377,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 9690,
          "index_count" : 1077,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 7518,
          "index_count" : 1074,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 24754,
          "index_count" : 1078,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 1799740,
          "index_count" : 1498,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 3150290,
          "index_count" : 1455,
          "script_count" : 0
        },
        {
          "name" : "match_only_text",
          "count" : 67942,
          "index_count" : 1084,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 21543,
          "index_count" : 1077,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 3726122,
          "index_count" : 1484,
          "script_count" : 0
        },
        {
          "name" : "rank_features",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "scaled_float",
          "count" : 166492,
          "index_count" : 1080,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 17100,
          "index_count" : 1489,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 10,
          "index_count" : 10,
          "script_count" : 0
        },
        {
          "name" : "wildcard",
          "count" : 19398,
          "index_count" : 1077,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ],
      "synonyms" : { }
    },
    "versions" : [
      {
        "version" : "8503000",
        "index_count" : 1512,
        "primary_shard_count" : 1512,
        "total_primary_bytes" : 212379558068
      }
    ],
    "search" : {
      "total" : 1287,
      "queries" : {
        "match_phrase" : 4,
        "bool" : 1281,
        "terms" : 19,
        "match" : 13,
        "match_phrase_prefix" : 1,
        "exists" : 174,
        "range" : 1038,
        "term" : 1037,
        "query_string" : 197,
        "simple_query_string" : 134
      },
      "rescorers" : { },
      "sections" : {
        "highlight" : 77,
        "stored_fields" : 154,
        "runtime_mappings" : 790,
        "query" : 1283,
        "script_fields" : 154,
        "_source" : 90,
        "pit" : 19,
        "fields" : 157,
        "aggs" : 1021
      }
    },
    "dense_vector" : {
      "value_count" : 0
    }
  },
  "nodes" : {
    "count" : {
      "total" : 6,
      "coordinating_only" : 0,
      "data" : 6,
      "data_cold" : 0,
      "data_content" : 0,
      "data_frozen" : 0,
      "data_hot" : 0,
      "data_warm" : 0,
      "index" : 0,
      "ingest" : 0,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "search" : 0,
      "transform" : 0,
      "voting_only" : 0
    },
    "versions" : [
      "8.13.4"
    ],
    "os" : {
      "available_processors" : 24,
      "allocated_processors" : 24,
      "names" : [
        {
          "name" : "Linux",
          "count" : 6
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 8",
          "count" : 4
        },
        {
          "pretty_name" : "CentOS Linux 8 (Core)",
          "count" : 2
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 6
        }
      ],
      "mem" : {
        "total_in_bytes" : 201084002304,
        "adjusted_total_in_bytes" : 201084002304,
        "free_in_bytes" : 3605323776,
        "used_in_bytes" : 197478678528,
        "free_percent" : 2,
        "used_percent" : 98
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 25
      },
      "open_file_descriptors" : {
        "min" : 2144,
        "max" : 2222,
        "avg" : 2197
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 4034567,
      "versions" : [
        {
          "version" : "21.0.2",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "21.0.2+13-58",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 6
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 27673302608,
        "heap_max_in_bytes" : 45097156608
      },
      "threads" : 473
    },
    "fs" : {
      "total_in_bytes" : 3030045253632,
      "free_in_bytes" : 1754957750272,
      "available_in_bytes" : 1754957750272
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 6
      },
      "http_types" : {
        "security4" : 6
      }
    },
    "discovery_types" : {
      "multi-node" : 6
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "tar",
        "count" : 6
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 30,
      "processor_stats" : {
        "attachment" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "date_index_name" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "dot_expander" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "foreach" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "inference" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "json" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "pipeline" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set_security_user" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "trim" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "uri_parts" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "user_agent" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    },
    "indexing_pressure" : {
      "memory" : {
        "current" : {
          "combined_coordinating_and_primary_in_bytes" : 0,
          "coordinating_in_bytes" : 0,
          "primary_in_bytes" : 0,
          "replica_in_bytes" : 0,
          "all_in_bytes" : 0
        },
        "total" : {
          "combined_coordinating_and_primary_in_bytes" : 0,
          "coordinating_in_bytes" : 0,
          "primary_in_bytes" : 0,
          "replica_in_bytes" : 0,
          "all_in_bytes" : 0,
          "coordinating_rejections" : 0,
          "primary_rejections" : 0,
          "replica_rejections" : 0
        },
        "limit_in_bytes" : 0
      }
    }
  },
  "snapshots" : {
    "current_counts" : {
      "snapshots" : 0,
      "shard_snapshots" : 0,
      "snapshot_deletions" : 0,
      "concurrent_operations" : 0,
      "cleanups" : 0
    },
    "repositories" : { }
  }
}

Just for FYI...

I am currently running the cluster with heap of 7GB per each node