1 of my nodes became unresponsive. So I logged only to see 100% memory usage on the Windows server. I stopped the ES process and the memory got completely freed up. I checked the logs there's no OOM exceptions also there's no dump files (Sometimes with a crash you get the dump file).
The only error that shows up is a index throttling error though that only started hapening 2-3 hours into the bulking process. So bulking had started ok for a couple of hours and then started throttling.
Here are some stats of the node running currently while bulking is hapening untill I can get the logs...
These are based on Windows Task Manager
Memory Private Set: 32GB ( I assume this is the 30G ES_HEAP_SIZE, doesn't seem to be growing.)
Working Set: 74GB (Seems to be growing)
Commit Size: 38GB
Paged Pool: 4.6GB
Non Paged Pool: 1GB
These are based on RAMMap
Mapped File (Total): 88GB
Mapped File (Active): 42GB
Process Private (Total): 38GB
Process Private (Active): 38GB
Paged Pool (Total): 4GB
Paged Pool (Active): 4GB
Non Paged Pool (Total): 3GB
Non Paged Pool (Active): 3GB
Server was started on 2015-06-06
100% memory may have happened around 2015-06-06 21:00
Problem detected on 2015-06-07 10:30 and ES process was stopped and restarted. (Any logs after that moment are "irrelevant")
So I think it's the mapped file. Another node is at 110GB of my 128GB.
Memory (Private Working Set): 33GB
Memory Working Set: 108GB
The docs say leave half for ES and the rest for mapped files.
I have 128GB per nodes, ES is configured with ES_HEAP_SIZE=30gb
I guess the OS will try to cache as much as it can regardless how much heap is configured for ES/Lucene?
I also assume that mlockall for Linux would be beneficial here. I saw that also there's experimentation going on to get mlockall working on Windows also.
I'm curious is there a setting in Windows that can limit the total mem used for mapped files?
2,500 to 3,000 bulk requests per 2-3 at 3000 bytes average per doc. So about 9MB per 2-3 second lets say. It takes ES about 3 seconds to bulk index 3000 docs. Basically I have a JMeter script that generates data and sends it to a vertx.io application that stores each request in a map and then bulks that to ES using TransportClient and BulkRequest.
It happened again just now for another machine.
Working set was 123GB
And Private Set still at 33GB.
But just a few seconds into me loging, it freed itself.
Now Working Set is at 103 seems to be dropping slowly...
There's no OOM excpetions, the node go dropped from cluster since it was unresponsive but once the RAM in Windows cleared the process rejoined the cluster. Didn't even have to restart it. Currently running ES in command line Window.
The only exception is NodeNotConnectedException and there the standard GC message for Young GC taking about 800ms
And of course the dreadedrecovery slowly taking it time :).
Not quite sure which of the two it is. But now that I'm running ES as Windows Service, the Working Set has not gone above 60GB. That or it could have been that I had disabled Windows Swap file?
@javadevmtl - I'm experiencing a similar issue; did you determine what the issue was, and how you resolved it? I'm already running ES as a Windows Service, but am hesitant to disable the swap file
The large working set should not be an issue. Elasticsearch uses mmap to read index files, but this should only eat address space, not physical memory.
@javadevmtl - thanks for the reply @jpountz - thank you for the link; some interesting items to tune (haven't tried yet; see below)
Our 100% windows memory usage turned out to be a symptom, not the actual issue. The real issue was a runaway query (regexp filter, with a .something.), which then caused indexing performance issues. Memory may still be a factor, but needs more investigation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.