I am running a 16 node cluster, broken down as such:
3 master nodes
3 search nodes
10 data nodes
Elasticsearch version 2.0.0
Over the last few weeks, we have been noticing something odd. Whatever node has been elected as the Master would not garbage collected when it's heap hit 75%, as expected. Eventually the heap would rise in till it hit 100%, and then the node would time out, causing cluster instability. Restarting the node at any point would fix this (including restarting it preemptively) but, from my understanding, it should be GCing itself during that time. The other master nodes are doing very little (their heap percentage is around 2-4%), and the data nodes are GCing perfectly fine.
A graph of our Heap percentages over time:
Nothing useful is in the logs, so I am at a lost as to what is causing this problem.