We have been experiencing issues with the master node (also a data node) running out of memory. I've been gathering metrics and correlating the heap used percent against our collected metrics. I found that in the 5, 2, and 1 day(s) leading up to the master node failures, the refresh rate (refreshes per second), query cache evictions, and query cache byte size increase linearly (strongly correlated) with the heap used percent. As the heap increases towards 100%, I could see the refresh rate was also increase from 12 refreshes per second to 14 per second. I've been trying to draw out some conclusions from this analysis, but I'm not quite sure I can make actionable changes yet. Does any of this seem plausible?
As for the query cache size in bytes, I've noticed that once the query cache gets to around 140-160MB, this is about when the master goes down (not suggesting the two are related only an observation). -btw Is there a setting related to query cache size? 140-160MB is low when compared to the total heap allocated which is 24GB. Therefore I feel like I'm missing another part of the story here. Any thoughts on this would be appreciated as well!