Elasticsearch Nodes Randomly Crashing Due to JVM Native Memory Allocation Failure on Windows Servers

Shubham_Khodpe · May 20, 2026, 7:51am

Hi Team,

We are facing frequent Elasticsearch node crashes across all nodes in our cluster and need guidance on identifying the root cause and recommended tuning.

Environment Details

Elasticsearch Version: 9.1.3
OS: Windows Server
Cluster Type: Multi-node cluster
Each Server Configuration:
- CPU: 16 Cores
- RAM: 32 GB
JVM Heap Configuration:
- Xms = 16g
- Xmx = 16g

Issue

All Elasticsearch nodes randomly stop multiple times during operations such as indexing, shard relocation, and snapshot/archive activities.

We are seeing the following errors in logs:

There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (malloc) failed to allocate bytes.
Chunk::new

Elasticsearch exited unexpectedly, with exit code 1

We also see related errors such as:

org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

Additional Observations

Memory utilization on all Elasticsearch nodes remains consistently high, around 95% to 98%.
Cluster frequently performs shard relocation/recovery.
We are also using snapshots/archive storage.
Heavy ingestion is happening through Logstash.
Grafana dashboards are connected to Elasticsearch.
The issue is occurring on all nodes.

Questions

Uploading: Media (2).jpg…

Is 16 GB heap too high for a 32 GB Windows server in Elasticsearch?
Could this be caused by native/off-heap memory exhaustion?
Are there any recommended JVM settings for Windows environments?
Could shard count or shard recovery activity be causing these crashes?
What is the recommended heap size and tuning for our server configuration?
Does consistently high memory utilization (95–98%) indicate improper heap sizing or insufficient OS/native memory availability?

Any suggestions or best practices would be very helpful.

Thanks.

RainTown · May 20, 2026, 8:28am

Is this the same cluster discussed in [your previous thread] (Elasticsearch data node instability and indexing failures when using NAS (shared storage) with separate folders per node) ?

In that thread, you had:

We are observing the following issues:

Bulk indexing failures (HTTP 500 errors)

Errors such as:

AlreadyClosedException: this ReferenceManager is closed

RemoteTransportException

UnavailableShardsException

Shards becoming unavailable intermittently

Cluster not recovering properly at times

Locking Issue

Grafana dashboards taking a long time to load data

If it is same cluster, how did you address the storage issues?

Shubham_Khodpe · May 20, 2026, 8:58am

It is same cluster but, currently data is stored on local disk of each server and Not on NAS/SAN drive

thecoop · May 20, 2026, 9:48am

Firstly, do you have anything else running on the nodes? Any other significant processes that could be using memory? Or have your workloads changed recently?

If there's nothing else running on the nodes, the next course of action is to reduce the Elasticsearch memory usage, say to 12GB, and see what effect it has. Does it reduce or eliminate the OOMs? What effect does it have on performance? How about 8GB? What if you increase the memory on the nodes to 64GB? At this point, you need more data on what the system is doing to be able to judge the best course of action.

Shubham_Khodpe · May 20, 2026, 1:03pm

Currently only single elastic service is running on all data nodes, On Master elasticsearch and Kibana is running.

Shubham_Khodpe · June 4, 2026, 9:50am

I am facing an intermittent issue in my production Elasticsearch cluster and would appreciate guidance on identifying the root cause and implementing a permanent solution.

Environment Details

Elasticsearch Version: 9.1.3
Operating System: Windows Server
Deployment Type: Multi-node Elasticsearch Cluster
Number of Nodes: 6
- 2 Master Nodes
- 4 Data Nodes
Hardware Configuration (All Nodes):
- 32 GB RAM
- 16 CPU Cores
JVM Heap:
- Initially configured with 12 GB heap
- Recently increased to 16 GB heap (-Xms16g -Xmx16g )
Elasticsearch is running in the background through Windows Task Scheduler.
Scheduler is configured to skip execution if the process is already running.

Issue Description

The cluster works normally for several hours or even days, but at random times the cluster health changes from GREEN to YELLOW or RED.

When this happens:

Unassigned shard count starts increasing automatically.
Sometimes primary shards become unassigned.
Grafana dashboards stop loading data because Elasticsearch indices become unavailable.
Elasticsearch services remain running on all nodes.
Server uptime remains unchanged.
No server reboot or Elasticsearch service restart occurs during the incident.

Observations

Cluster health output shows:

Multiple unassigned shards
Occasionally unassigned primary shards
Cluster status changes to RED

Some of the errors observed in Elasticsearch logs:

timed out while waiting to acquire shard lock
allocation_status[no_valid_shard_copy]
NoLongerPrimaryShardException

GC-related warnings observed:

[gc] overhead, spent 10s collecting in the last 11s
timer thread slept for 10s

After increasing heap from 12 GB to 16 GB, GC performance improved significantly, but the cluster still occasionally experiences shard allocation issues.

I have also observed date parsing errors from application data:

failed to parse date field

However, I believe these are unrelated to the cluster health issue.

Current Cluster Status

Disk utilization on data nodes is between 50% and 70%.
No node appears to be running out of disk space.
All Elasticsearch nodes remain online.
No planned restarts or maintenance activities are being performed when the issue occurs.

Questions

What could cause primary and replica shards to become unassigned while all nodes remain online?
Can long GC pauses alone trigger shard allocation failures and RED cluster status?
Are there any known issues or recommendations for running Elasticsearch on Windows using Task Scheduler instead of Windows Services?
What is the best way to identify the exact root cause of these intermittent shard allocation failures?
Is there any recommended cluster setting, JVM tuning, shard allocation setting, or architecture change that can permanently prevent this issue?
Has anyone experienced similar behavior where the cluster randomly becomes RED/YELLOW without any server reboot or Elasticsearch service restart?

Any recommendations on troubleshooting steps, best practices, or permanent fixes would be greatly appreciated.

Thank you.

Topic		Replies	Views
35 shards but maxing out JVM heap Elasticsearch	11	4461	March 8, 2018
High node JVM heap cause ES cluster almost stop working Elasticsearch	5	1816	January 27, 2019
Elasticsearch (6.4.1) - JVM OutOfMemoryError Elasticsearch	4	1072	May 29, 2019
Out of memory error in elasticsearch Elasticsearch	1	2731	May 24, 2021
Out Of Memory crash, few documents & load Elasticsearch	9	2445	October 14, 2021

Elasticsearch Nodes Randomly Crashing Due to JVM Native Memory Allocation Failure on Windows Servers

Environment Details

Issue

Additional Observations

Questions

Environment Details

Issue Description

Observations

Current Cluster Status

Questions

Related topics