I am using elastic 8.13.4 and I have 3 machines with 30 gb of RAM and 1tb of hard disk for each machine. I am creating 2 nodes per each machine through elastic portable download ealsticsearch-8.13.4.tar.gz. Each node was given xms & xmx of 9GB in java.options file. The cluster was working file for months without any issue. and I am deleting the indices manually through index management from kibana to maintain data of 6 months. However, recently the cluster was getting down after about 2-3 hours. I am not able to find the exact issue. The logs are saying
[2025-12-10T12:16:02,954][INFO ][o.e.m.j.JvmGcMonitorService] [Kale] [gc][361] overhead, spent [437ms] collecting in the last [1s] [2025-12-10T12:16:40,008][INFO ][o.e.m.j.JvmGcMonitorService] [Kale] [gc][398] overhead, spent [278ms] collecting in the last [1s] [2025-12-10T12:16:45,009][INFO ][o.e.m.j.JvmGcMonitorService] [Kale] [gc][403] overhead, spent [286ms] collecting in the last [1s] [2025-12-10T15:04:23,451][INFO ][o.e.x.m.p.NativeController] [Kale] Native controller process has stopped - no new native processes can be started [2025-12-10T15:04:23,537][INFO ][o.e.n.Node ] [Kale] stopping ... [2025-12-10T15:04:23,540][INFO ][o.e.x.w.WatcherService ] [Kale] stopping watch service, reason [shutdown initiated] [2025-12-10T15:04:23,541][INFO ][o.e.x.w.WatcherLifeCycleService] [Kale] watcher has stopped and shutdown [2025-12-10T15:04:23,595][INFO ][o.e.c.c.Coordinator ] [Kale] master node [{Europa}{asrawfdsvdzvdc}{asrawfdsvdzvdc}{Europa}{x.x.x.x.}{x.x.x.x.:9300}{dm}{8.13.4}{7000099-8503000}] disconnected, restarting discovery [2025-12-10T15:04:23,608][INFO ][o.e.h.AbstractHttpServerTransport] [Kale] channel [Netty4HttpChannel{localAddress=/x.x.x.x.:9201, remoteAddress=/x.x.x.y:64422}] already closed [2025-12-10T15:04:30,351][INFO ][o.e.n.Node ] [Kale] stopped [2025-12-10T15:04:30,351][INFO ][o.e.n.Node ] [Kale] closing ... [2025-12-10T15:04:30,444][INFO ][o.e.n.Node ] [Kale] closed
[2025-12-10T14:51:19,565][INFO ][o.e.c.r.a.AllocationService] [Europa] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[perfmon_adc01-2025.12.05][0]]])." previous.health="YELLOW" reason="shards started [[perfmon_adc01-2025.12.05][0]]" [2025-12-10T14:51:23,259][INFO ][o.e.c.m.MetadataCreateIndexService] [Europa] [perfmon_web02-2025.11.26] creating index, cause [auto(bulk api)], templates , shards [1]/[1] [2025-12-10T15:04:23,310][WARN ][o.e.c.s.MasterService ] [Europa] took [13m/780053ms] to compute cluster state update for [auto create [perfmon_web02-2025.11.26][org.elasticsearch.action.admin.indices.create.AutoCreateAction$TransportAction$CreateIndexTask@67c24fce]], which exceeds the warn threshold of [10s] [2025-12-10T15:04:23,453][INFO ][o.e.c.m.MetadataCreateIndexService] [Europa] [perfmon_adc02-2025.12.05] creating index, cause [auto(bulk api)], templates , shards [1]/[1] [2025-12-10T15:04:23,465][INFO ][o.e.x.m.p.NativeController] [Europa] Native controller process has stopped - no new native processes can be started [2025-12-10T15:04:23,538][INFO ][o.e.n.Node ] [Europa] stopping ... [2025-12-10T15:04:23,544][INFO ][o.e.c.f.AbstractFileWatchingService] [Europa] shutting down watcher thread [2025-12-10T15:04:23,560][INFO ][o.e.c.f.AbstractFileWatchingService] [Europa] watcher service stopped [2025-12-10T15:04:23,561][INFO ][o.e.x.w.WatcherService ] [Europa] stopping watch service, reason [shutdown initiated] [2025-12-10T15:04:23,563][INFO ][o.e.x.w.WatcherLifeCycleService] [Europa] watcher has stopped and shutdown [2025-12-10T15:04:23,613][INFO ][o.e.t.ClusterConnectionManager] [Europa] transport connection to [{Kale}{asrawfdsvdzvdc}{asrawfdsvdzvdc-cGg}{Kale}{x.x.x.x.102}{x.x.x.x.102:9301}{d}{8.13.4}{7000099-8503000}] closed by remote [2025-12-10T15:04:23,715][WARN ][o.e.c.NodeConnectionsService] [Europa] failed to connect to {Kale}{asrawfdsvdzvdc}{asrawfdsvdzvdc-cGg}{Kale}{x.x.x.x.102}{x.x.x.x.102:9301}{d}{8.13.4}{7000099-8503000}{xpack.installed=true, ml.config_version=12.0.0, transform.config_version=10.0.0} (tried [1] times)
What is the resolution for this issue? Please help me here
[2025-12-10T15:04:23,310][WARN ][o.e.c.s.MasterService ] [Europa] took [13m/780053ms] to compute cluster state update for [auto create [perfmon_web02-2025.11.26][org.elasticsearch.action.admin.indices.create.AutoCreateAction$TransportAction$CreateIndexTask@67c24fce]], which exceeds the warn threshold of [10s]
why was it trying to auto-create an index, with 2025.11.26 in the index name, on 2025-12-10 ? In any case that it took 13 minutes (sic) is indicator of a significant problem.
Just to be clear, these 2 "nodes" are just different Elastic (java) processes running directly on the same host, or there is some other layer involved (virtual machines, containers, whatever).
Please check all nodes logs for ERROR and WARN before the crashes. Also check the system logs for any limits you maybe have reached, OOMs, or similar.
To get a better view of the state of the cluster it would be great if you could post the full output of the cluster stats API.
Each node should have no more than 50% of the memory available to it assigned to heap. If you have 2 nodes and a total of 30GB RAM (assuming no other processes are running on the host that consume significant resources) your heap size should not be larger than 7.5GB.
Elasticsearch also asumes it has full access to all vailable RAM so if you are using VMs it would be useful to verify that these are not overprovisioned so memory usage results in swapping behind the scenes (which could lead to long GC times).
It would also help to know exactly what type of storage you are using as this is a common cause of performance problems. Is it local SSD, local HDD or some type of networked storage?
I have the machine as a VM hypervisor. and running the elastic 2 nodes as a seperate processes. Not as a docker and there are no other layers are involved. And there are no ERROR or WARN logs apart from above mentioned before the crash
The specific command sorted the indices in create date order. Look at it and see if it makes sense, you know your data.
2 shards/index, 1 replicas, would be 6000+ shards. IIRC default limit is 1000 shards/node. and you have 6 nodes. Suggestive, if a wild guess.
This tends to create loads of indices of wildly different sizes, which is not good as your cluster state file gets big. 1500 indices is big.
Yeah, not great.
The point about memory is that running both "nodes" on same host means they basically compete with each other for the same memory. Also, make sure you have no swap space please.
If you want, you can send to me in a DM and I'll paste for you. Likely it's a large text file so maybe better to use pastebin or similar, and just paste the link.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.