ES 5.6 gobbles up entire OS RAM after 30-40 mins of bulk indexing and no searching

Dear All,
We are using Elastic Search in production environment on 6 servers each of which have 80vcpus, 286GB of RAM and with 2 ES 5.6 Data nodes running on each of them.

Each data node is given just less than 32GB of RAM to ensure compressed OOPS.
Spinning speed is set to HDD.
Async flush is set.
Translog size is set to 1GB.
Refresh interval is set to 30s.
mlock is true.
swapiness is off.
File descriptors count are above 2,00,000.
Mmap count is 2612445.

There is one client application which is doing bulk insertion in ES. This client application uses 16GB of RAM residing on same server.

The system is under bulk indexing load with 6 clients doing bulk insertion in ES with each instance having batch size of 7500 with 2 concurrent actions. The system is used not very often for searching(like 5-10 aggregation query once every hour at hour rollover).

For 30-40 mins during load, the ES works properly (i.e. the nodes are taking in huge loads and working smoothly).
Top command and free -g shows that system is healthy and my client is also able to out load.
After 40-45 mins, the free -g on each of the server shows 0 under free and entire available memory under shared/cache.

Now I do understand that I can still run applications as these are shared memory and available to any application which demands it, BUT my client starts experiencing lag due to no available memory and fails to work at same pace which leads to decreased throughput.

I have gone through numerous post surrounding this. None helped.
HOW do I restrict ES/Lucene to take up only x% of total OS RAM available on my OS. For example in above scenario,

2 ES Data Nodes - 64 GB.
1 client application - 16 GB
Leaving equal amount of ES RAM to Lucene for caching - 64GB

I should still be left with roughly 140GB of Memory by limting Lucene/ES to use only 64GB of my OS RAM.

I have not found any variable, any parameter which can be set to restrict this. On some post I saw it might be due to netty or due to some bug, but I am not able to find any variable or parameter which can be tuned for this.

Other applications(1 or 2) which run on same server as that of ES starts experiencing heavy lag and soon leads to decreased throughput. Due to limitation, I need to have atleast 1 or 2 instances of my application running on these servers.

If any further information is needed, please do let me know.
Please help.

As you point out this is shared/cached memory and this is an OS level setting that is outside Elasticsearch's ability to maintain. You'll have to look up memory management for your OS.

HI Mark,
This is not OS level issue. Every time bulk load is running for 30-40 mins, the entire cache is taken up by ES process. This makes all the other applications running on same server to respond slowly.
I tried changing file store to niofs as well but to no avail. The RAM used by ES still reaches 100%.
ES alone takes up 100% of RAM in cache/virtual memory.
There has to be some way by which I can restrict ES to occupy some % of shared/virtual memory.
ES version is 5.6

As I mentioned there is, it's via the OS.

Hi Mark. Thanks for the reply.
What I have been trying to say is when I give ES 32 G of Heap and leave rest for Lucene to do its magic while my other applications on same server are running, Lucene+ES (in top command on RHEL, the user is still es using which ES was booted up), es user takes up entire virtual memory leading to complete starvation of my other application running on same physical hardware.
I have gone through numerous post to no avail.

If you can assist me in identifying parameters in OS which can be changed to restrict this, it would be really helpful.
Huge pages are disabled.
I tried limiting virtual memory in limits.conf however ES did not even start due to bootstrap checks for unlimited virtual memory.
I tried putting in Java opts, NoPreferDirect but even that did not helped.
Tried by disabling ExplicitGC did not work.
Tried changing the netty to netty4, but ES gave an exception that it has to be security3 or security4 and neither of them worked eventually.
Changed file to "niofs". Did not worked.
Did other changes in OS to no avail.

Please guide me on OS level setting if it can be controlled via OS.
We are completely lost in finding a solution for this.
I am running ES 5.6 on RHEL 7.5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.