Elasticsearch Nodes Randomly Crashing Due to JVM Native Memory Allocation Failure on Windows Servers

Hi Team,

We are facing frequent Elasticsearch node crashes across all nodes in our cluster and need guidance on identifying the root cause and recommended tuning.

Environment Details

  • Elasticsearch Version: 9.1.3
  • OS: Windows Server
  • Cluster Type: Multi-node cluster
  • Each Server Configuration:
    • CPU: 16 Cores
    • RAM: 32 GB
  • JVM Heap Configuration:
    • Xms = 16g
    • Xmx = 16g

Issue

All Elasticsearch nodes randomly stop multiple times during operations such as indexing, shard relocation, and snapshot/archive activities.

We are seeing the following errors in logs:

There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (malloc) failed to allocate bytes.
Chunk::new

Elasticsearch exited unexpectedly, with exit code 1

We also see related errors such as:

org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

Additional Observations

  • Memory utilization on all Elasticsearch nodes remains consistently high, around 95% to 98%.
  • Cluster frequently performs shard relocation/recovery.
  • We are also using snapshots/archive storage.
  • Heavy ingestion is happening through Logstash.
  • Grafana dashboards are connected to Elasticsearch.
  • The issue is occurring on all nodes.

Questions

Uploading: Media (2).jpg…

  1. Is 16 GB heap too high for a 32 GB Windows server in Elasticsearch?
  2. Could this be caused by native/off-heap memory exhaustion?
  3. Are there any recommended JVM settings for Windows environments?
  4. Could shard count or shard recovery activity be causing these crashes?
  5. What is the recommended heap size and tuning for our server configuration?
  6. Does consistently high memory utilization (95–98%) indicate improper heap sizing or insufficient OS/native memory availability?

Any suggestions or best practices would be very helpful.

Thanks.

Is this the same cluster discussed in [your previous thread] (Elasticsearch data node instability and indexing failures when using NAS (shared storage) with separate folders per node) ?

In that thread, you had:

We are observing the following issues:

  • Bulk indexing failures (HTTP 500 errors)
  • Errors such as:
    • AlreadyClosedException: this ReferenceManager is closed
    • RemoteTransportException
    • UnavailableShardsException
  • Shards becoming unavailable intermittently
  • Cluster not recovering properly at times
  • Locking Issue
  • Grafana dashboards taking a long time to load data

If it is same cluster, how did you address the storage issues?

It is same cluster but, currently data is stored on local disk of each server and Not on NAS/SAN drive

Firstly, do you have anything else running on the nodes? Any other significant processes that could be using memory? Or have your workloads changed recently?

If there's nothing else running on the nodes, the next course of action is to reduce the Elasticsearch memory usage, say to 12GB, and see what effect it has. Does it reduce or eliminate the OOMs? What effect does it have on performance? How about 8GB? What if you increase the memory on the nodes to 64GB? At this point, you need more data on what the system is doing to be able to judge the best course of action.