Elasticsearch Nodes Randomly Crashing Due to JVM Native Memory Allocation Failure on Windows Servers

Hi Team,

We are facing frequent Elasticsearch node crashes across all nodes in our cluster and need guidance on identifying the root cause and recommended tuning.

Environment Details

  • Elasticsearch Version: 9.1.3
  • OS: Windows Server
  • Cluster Type: Multi-node cluster
  • Each Server Configuration:
    • CPU: 16 Cores
    • RAM: 32 GB
  • JVM Heap Configuration:
    • Xms = 16g
    • Xmx = 16g

Issue

All Elasticsearch nodes randomly stop multiple times during operations such as indexing, shard relocation, and snapshot/archive activities.

We are seeing the following errors in logs:

There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (malloc) failed to allocate bytes.
Chunk::new

Elasticsearch exited unexpectedly, with exit code 1

We also see related errors such as:

org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

Additional Observations

  • Memory utilization on all Elasticsearch nodes remains consistently high, around 95% to 98%.
  • Cluster frequently performs shard relocation/recovery.
  • We are also using snapshots/archive storage.
  • Heavy ingestion is happening through Logstash.
  • Grafana dashboards are connected to Elasticsearch.
  • The issue is occurring on all nodes.

Questions

Uploading: Media (2).jpg…

  1. Is 16 GB heap too high for a 32 GB Windows server in Elasticsearch?
  2. Could this be caused by native/off-heap memory exhaustion?
  3. Are there any recommended JVM settings for Windows environments?
  4. Could shard count or shard recovery activity be causing these crashes?
  5. What is the recommended heap size and tuning for our server configuration?
  6. Does consistently high memory utilization (95–98%) indicate improper heap sizing or insufficient OS/native memory availability?

Any suggestions or best practices would be very helpful.

Thanks.

Is this the same cluster discussed in [your previous thread] (Elasticsearch data node instability and indexing failures when using NAS (shared storage) with separate folders per node) ?

In that thread, you had:

We are observing the following issues:

  • Bulk indexing failures (HTTP 500 errors)
  • Errors such as:
    • AlreadyClosedException: this ReferenceManager is closed
    • RemoteTransportException
    • UnavailableShardsException
  • Shards becoming unavailable intermittently
  • Cluster not recovering properly at times
  • Locking Issue
  • Grafana dashboards taking a long time to load data

If it is same cluster, how did you address the storage issues?

It is same cluster but, currently data is stored on local disk of each server and Not on NAS/SAN drive

Firstly, do you have anything else running on the nodes? Any other significant processes that could be using memory? Or have your workloads changed recently?

If there's nothing else running on the nodes, the next course of action is to reduce the Elasticsearch memory usage, say to 12GB, and see what effect it has. Does it reduce or eliminate the OOMs? What effect does it have on performance? How about 8GB? What if you increase the memory on the nodes to 64GB? At this point, you need more data on what the system is doing to be able to judge the best course of action.

Currently only single elastic service is running on all data nodes, On Master elasticsearch and Kibana is running.

I am facing an intermittent issue in my production Elasticsearch cluster and would appreciate guidance on identifying the root cause and implementing a permanent solution.

Environment Details

  • Elasticsearch Version: 9.1.3
  • Operating System: Windows Server
  • Deployment Type: Multi-node Elasticsearch Cluster
  • Number of Nodes: 6
    • 2 Master Nodes
    • 4 Data Nodes
  • Hardware Configuration (All Nodes):
    • 32 GB RAM
    • 16 CPU Cores
  • JVM Heap:
    • Initially configured with 12 GB heap
    • Recently increased to 16 GB heap (-Xms16g -Xmx16g )
  • Elasticsearch is running in the background through Windows Task Scheduler.
  • Scheduler is configured to skip execution if the process is already running.

Issue Description

The cluster works normally for several hours or even days, but at random times the cluster health changes from GREEN to YELLOW or RED.

When this happens:

  • Unassigned shard count starts increasing automatically.
  • Sometimes primary shards become unassigned.
  • Grafana dashboards stop loading data because Elasticsearch indices become unavailable.
  • Elasticsearch services remain running on all nodes.
  • Server uptime remains unchanged.
  • No server reboot or Elasticsearch service restart occurs during the incident.

Observations

Cluster health output shows:

  • Multiple unassigned shards
  • Occasionally unassigned primary shards
  • Cluster status changes to RED

Some of the errors observed in Elasticsearch logs:

timed out while waiting to acquire shard lock
allocation_status[no_valid_shard_copy]
NoLongerPrimaryShardException

GC-related warnings observed:

[gc] overhead, spent 10s collecting in the last 11s
timer thread slept for 10s

After increasing heap from 12 GB to 16 GB, GC performance improved significantly, but the cluster still occasionally experiences shard allocation issues.

I have also observed date parsing errors from application data:

failed to parse date field

However, I believe these are unrelated to the cluster health issue.

Current Cluster Status

  • Disk utilization on data nodes is between 50% and 70%.
  • No node appears to be running out of disk space.
  • All Elasticsearch nodes remain online.
  • No planned restarts or maintenance activities are being performed when the issue occurs.

Questions

  1. What could cause primary and replica shards to become unassigned while all nodes remain online?
  2. Can long GC pauses alone trigger shard allocation failures and RED cluster status?
  3. Are there any known issues or recommendations for running Elasticsearch on Windows using Task Scheduler instead of Windows Services?
  4. What is the best way to identify the exact root cause of these intermittent shard allocation failures?
  5. Is there any recommended cluster setting, JVM tuning, shard allocation setting, or architecture change that can permanently prevent this issue?
  6. Has anyone experienced similar behavior where the cluster randomly becomes RED/YELLOW without any server reboot or Elasticsearch service restart?

Any recommendations on troubleshooting steps, best practices, or permanent fixes would be greatly appreciated.

Thank you.