ES Heap to 100% and cluster halt

Hi guys
Hope one of you can help...
In our prod environment, we have a 5 data nodes cluster (data:true,
master:false) + 3 masters (master:true, data:false). Elasticsearch 1.4.4,
Oracle Java 1.8. 40.
Data nodes have 30GB memory, masters 15GB.
We have a problem where the Heap crosses the heap limit in some nodes, and
the whole cluster comes to a stop. This happens in maybe one or two nodes,
while the other ones are still ok.
No out of memory errors are displayed, but on the nodes that are still
alive, you can see some errors like "No search context for id [xxxxx]". I
need to restart the whole cluster for it to become responsive again.
In the heap usage, i see that it behaves properly for a while, doing a nice
saw pattern, but after a while (~1 day), some node starts going up and up
without dropping anytime, then crossing the limit.

You can see some of this in this graph of one of our crashes:
Also, i can notice that the CPU usage gets to a peak when that raise starts.

In elasticsearch.yml I don't have many important settings other than
bootstrap.mlockall: true.

In the enviroment variables file I have:


Memory usage on the nodes seem to be fine, having around 6GB free all the
time (even during the crashes).

Field data seems to be around 300MB all the time, while filter cache is
1.5GB (10% of the Heap, as default).
(In that graph you can see the filter size in 2 nodes going up at the end,
that's when I increased it to 25% in 2 nodes, but same effect, cluster
crashes the same way).

I wonder if this is something related to, but seems to be fixed
by 1.4.4.

Any help will be greatly appreciated.

Kind regards

