Hi Team,
We are facing frequent Elasticsearch node crashes across all nodes in our cluster and need guidance on identifying the root cause and recommended tuning.
Environment Details
- Elasticsearch Version: 9.1.3
- OS: Windows Server
- Cluster Type: Multi-node cluster
- Each Server Configuration:
- CPU: 16 Cores
- RAM: 32 GB
- JVM Heap Configuration:
- Xms = 16g
- Xmx = 16g
Issue
All Elasticsearch nodes randomly stop multiple times during operations such as indexing, shard relocation, and snapshot/archive activities.
We are seeing the following errors in logs:
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (malloc) failed to allocate bytes.
Chunk::new
Elasticsearch exited unexpectedly, with exit code 1
We also see related errors such as:
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
Additional Observations
- Memory utilization on all Elasticsearch nodes remains consistently high, around 95% to 98%.
- Cluster frequently performs shard relocation/recovery.
- We are also using snapshots/archive storage.
- Heavy ingestion is happening through Logstash.
- Grafana dashboards are connected to Elasticsearch.
- The issue is occurring on all nodes.
Questions
- Is 16 GB heap too high for a 32 GB Windows server in Elasticsearch?
- Could this be caused by native/off-heap memory exhaustion?
- Are there any recommended JVM settings for Windows environments?
- Could shard count or shard recovery activity be causing these crashes?
- What is the recommended heap size and tuning for our server configuration?
- Does consistently high memory utilization (95–98%) indicate improper heap sizing or insufficient OS/native memory availability?
Any suggestions or best practices would be very helpful.
Thanks.
