We are facing frequent Elasticsearch node crashes across all nodes in our cluster and need guidance on identifying the root cause and recommended tuning.
Environment Details
Elasticsearch Version: 9.1.3
OS: Windows Server
Cluster Type: Multi-node cluster
Each Server Configuration:
CPU: 16 Cores
RAM: 32 GB
JVM Heap Configuration:
Xms = 16g
Xmx = 16g
Issue
All Elasticsearch nodes randomly stop multiple times during operations such as indexing, shard relocation, and snapshot/archive activities.
We are seeing the following errors in logs:
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (malloc) failed to allocate bytes.
Chunk::new
Elasticsearch exited unexpectedly, with exit code 1
We also see related errors such as:
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
Additional Observations
Memory utilization on all Elasticsearch nodes remains consistently high, around 95% to 98%.
Firstly, do you have anything else running on the nodes? Any other significant processes that could be using memory? Or have your workloads changed recently?
If there's nothing else running on the nodes, the next course of action is to reduce the Elasticsearch memory usage, say to 12GB, and see what effect it has. Does it reduce or eliminate the OOMs? What effect does it have on performance? How about 8GB? What if you increase the memory on the nodes to 64GB? At this point, you need more data on what the system is doing to be able to judge the best course of action.
I am facing an intermittent issue in my production Elasticsearch cluster and would appreciate guidance on identifying the root cause and implementing a permanent solution.
Environment Details
Elasticsearch Version: 9.1.3
Operating System: Windows Server
Deployment Type: Multi-node Elasticsearch Cluster
Number of Nodes: 6
2 Master Nodes
4 Data Nodes
Hardware Configuration (All Nodes):
32 GB RAM
16 CPU Cores
JVM Heap:
Initially configured with 12 GB heap
Recently increased to 16 GB heap (-Xms16g -Xmx16g )
Elasticsearch is running in the background through Windows Task Scheduler.
Scheduler is configured to skip execution if the process is already running.
Issue Description
The cluster works normally for several hours or even days, but at random times the cluster health changes from GREEN to YELLOW or RED.
Grafana dashboards stop loading data because Elasticsearch indices become unavailable.
Elasticsearch services remain running on all nodes.
Server uptime remains unchanged.
No server reboot or Elasticsearch service restart occurs during the incident.
Observations
Cluster health output shows:
Multiple unassigned shards
Occasionally unassigned primary shards
Cluster status changes to RED
Some of the errors observed in Elasticsearch logs:
timed out while waiting to acquire shard lock
allocation_status[no_valid_shard_copy]
NoLongerPrimaryShardException
GC-related warnings observed:
[gc] overhead, spent 10s collecting in the last 11s
timer thread slept for 10s
After increasing heap from 12 GB to 16 GB, GC performance improved significantly, but the cluster still occasionally experiences shard allocation issues.
I have also observed date parsing errors from application data:
failed to parse date field
However, I believe these are unrelated to the cluster health issue.
Current Cluster Status
Disk utilization on data nodes is between 50% and 70%.
No node appears to be running out of disk space.
All Elasticsearch nodes remain online.
No planned restarts or maintenance activities are being performed when the issue occurs.
Questions
What could cause primary and replica shards to become unassigned while all nodes remain online?
Can long GC pauses alone trigger shard allocation failures and RED cluster status?
Are there any known issues or recommendations for running Elasticsearch on Windows using Task Scheduler instead of Windows Services?
What is the best way to identify the exact root cause of these intermittent shard allocation failures?
Is there any recommended cluster setting, JVM tuning, shard allocation setting, or architecture change that can permanently prevent this issue?
Has anyone experienced similar behavior where the cluster randomly becomes RED/YELLOW without any server reboot or Elasticsearch service restart?
Any recommendations on troubleshooting steps, best practices, or permanent fixes would be greatly appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.