100% Windows memory Usage but no OOM

Running ES 1.5.2 and Windows 2008 R2 Ent

1 of my nodes became unresponsive. So I logged only to see 100% memory usage on the Windows server. I stopped the ES process and the memory got completely freed up. I checked the logs there's no OOM exceptions also there's no dump files (Sometimes with a crash you get the dump file).

The only error that shows up is a index throttling error though that only started hapening 2-3 hours into the bulking process. So bulking had started ok for a couple of hours and then started throttling.

So I can't explain the 100% usage.

Anything in the logs, GC, check hot_threads?

I'll post my logs somewhere asap...

Here are some stats of the node running currently while bulking is hapening untill I can get the logs...

These are based on Windows Task Manager
Memory Private Set: 32GB ( I assume this is the 30G ES_HEAP_SIZE, doesn't seem to be growing.)
Working Set: 74GB (Seems to be growing)
Commit Size: 38GB
Paged Pool: 4.6GB
Non Paged Pool: 1GB

These are based on RAMMap
Mapped File (Total): 88GB
Mapped File (Active): 42GB
Process Private (Total): 38GB
Process Private (Active): 38GB
Paged Pool (Total): 4GB
Paged Pool (Active): 4GB
Non Paged Pool (Total): 3GB
Non Paged Pool (Active): 3GB

So I stopped bulking and walked away for an hour.

Working Set: 49GB
Mapped File (Total): 79GB
Mapped File (Active): 42GB

So I suppose it is possible that bulking to up all RAM with mapped file?

Here are the logs...

Server was started on 2015-06-06
100% memory may have happened around 2015-06-06 21:00
Problem detected on 2015-06-07 10:30 and ES process was stopped and restarted. (Any logs after that moment are "irrelevant")

So I think it's the mapped file. Another node is at 110GB of my 128GB.

Memory (Private Working Set): 33GB
Memory Working Set: 108GB

The docs say leave half for ES and the rest for mapped files.
I have 128GB per nodes, ES is configured with ES_HEAP_SIZE=30gb

I guess the OS will try to cache as much as it can regardless how much heap is configured for ES/Lucene?
I also assume that mlockall for Linux would be beneficial here. I saw that also there's experimentation going on to get mlockall working on Windows also.

I'm curious is there a setting in Windows that can limit the total mem used for mapped files?

How big are your bulk requests?

2,500 to 3,000 bulk requests per 2-3 at 3000 bytes average per doc. So about 9MB per 2-3 second lets say. It takes ES about 3 seconds to bulk index 3000 docs. Basically I have a JMeter script that generates data and sends it to a vertx.io application that stores each request in a map and then bulks that to ES using TransportClient and BulkRequest.

It happened again just now for another machine.
Working set was 123GB
And Private Set still at 33GB.

But just a few seconds into me loging, it freed itself.

Now Working Set is at 103 seems to be dropping slowly...

There's no OOM excpetions, the node go dropped from cluster since it was unresponsive but once the RAM in Windows cleared the process rejoined the cluster. Didn't even have to restart it. Currently running ES in command line Window.

The only exception is NodeNotConnectedException and there the standard GC message for Young GC taking about 800ms

And of course the dreadedrecovery slowly taking it time :).

Not quite sure which of the two it is. But now that I'm running ES as Windows Service, the Working Set has not gone above 60GB. That or it could have been that I had disabled Windows Swap file?

@javadevmtl - I'm experiencing a similar issue; did you determine what the issue was, and how you resolved it? I'm already running ES as a Windows Service, but am hesitant to disable the swap file

Unfortunately no. I moved my nodes to Linux.

The large working set should not be an issue. Elasticsearch uses mmap to read index files, but this should only eat address space, not physical memory.

Does changing the store type to simplefs make things any better? https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html#index-modules-store

@javadevmtl - thanks for the reply
@jpountz - thank you for the link; some interesting items to tune (haven't tried yet; see below)

Our 100% windows memory usage turned out to be a symptom, not the actual issue. The real issue was a runaway query (regexp filter, with a .something.), which then caused indexing performance issues. Memory may still be a factor, but needs more investigation.