We are using Elastic Search in production environment on 6 servers each of which have 80vcpus, 286GB of RAM and with 2 ES 5.6 Data nodes running on each of them.
Each data node is given just less than 32GB of RAM to ensure compressed OOPS.
Spinning speed is set to HDD.
Async flush is set.
Translog size is set to 1GB.
Refresh interval is set to 30s.
mlock is true.
swapiness is off.
File descriptors count are above 2,00,000.
Mmap count is 2612445.
There is one client application which is doing bulk insertion in ES. This client application uses 16GB of RAM residing on same server.
The system is under bulk indexing load with 6 clients doing bulk insertion in ES with each instance having batch size of 7500 with 2 concurrent actions. The system is used not very often for searching(like 5-10 aggregation query once every hour at hour rollover).
For 30-40 mins during load, the ES works properly (i.e. the nodes are taking in huge loads and working smoothly).
Top command and free -g shows that system is healthy and my client is also able to out load.
After 40-45 mins, the free -g on each of the server shows 0 under free and entire available memory under shared/cache.
Now I do understand that I can still run applications as these are shared memory and available to any application which demands it, BUT my client starts experiencing lag due to no available memory and fails to work at same pace which leads to decreased throughput.
I have gone through numerous post surrounding this. None helped.
HOW do I restrict ES/Lucene to take up only x% of total OS RAM available on my OS. For example in above scenario,
2 ES Data Nodes - 64 GB.
1 client application - 16 GB
Leaving equal amount of ES RAM to Lucene for caching - 64GB
I should still be left with roughly 140GB of Memory by limting Lucene/ES to use only 64GB of my OS RAM.
I have not found any variable, any parameter which can be set to restrict this. On some post I saw it might be due to netty or due to some bug, but I am not able to find any variable or parameter which can be tuned for this.
Other applications(1 or 2) which run on same server as that of ES starts experiencing heavy lag and soon leads to decreased throughput. Due to limitation, I need to have atleast 1 or 2 instances of my application running on these servers.
If any further information is needed, please do let me know.