Our Environment is: 3 node ES Cluster with 3 data nodes. We have upgraded ES from 2.3.3 to 5.2.2.
The data nodes are allocated 31gb heap ( as recommended by ES Community ). Node1 mostly serves the search requests while Node 2/ Node 3 are used for Bulk insertions.
We have seen constant surge in the Memory usage of ES Node 1 ( node mostly used for search queries ), it starts up with 32g res memory and then res memory goes to 40g...45g...50g.. 56g...60g...62g and KILLED! ( by kernel's OOM killer ). This happens over the duration of 24-30hrs. The only thing we can do at this point is restart ES and same repeats over.
I have already gone through: Out of memory (invoked oom-killer) but this doesn't apply here as we are running on a Physical server with centos 6.8 ( 2.6.32-642.6.2.el6 ) with 64GB RAM and 24 core processors.
From this article, https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html , we can make out that ( guess ) that the memory above 32g+ is related to Lucene caching. Can someone please throw more light on what's happening here?
I have gone through github issues which already says some memory leak issue was already fixed before es5.2.2.
It'd be great if someone can help understand this behaviour and possible solution to this.
PS: This was not happening with ES 2.3.3 on contrary.