We are currently running version 2.4.0, with 20 nodes in the cluster containing ~49,000 shards with ~17TB of data. I'm consistently seeing an issue where it seems like the master node gradually runs out of heap memory (24GB) over a 10 day period. During each approximate 10 day cycle the heap gradually moves up until it can't handle any more. From logs, I could see that as the master got very close to 100% I saw a sudden increase in unassigned shards. I'm pretty sure those were the shards on the master. In this case the master switched and the previous master didn't crash. However I could also see very long GC collects occuring right before the node hits 100%. We have been hitting this issue for a months and I'm a little lost in terms of finding the root cause.
In the more recent past I've been trying to monitor various statistics available via the nodes API. Unfortunately I'm not seeing much other than I can see the heap percentage gradually skew towards 100%. (See attached image). I was hoping I could get some help with breaking down the issue.
Is there anything in the nodes statistics to help me understand what is gradually filling up the heap? I was looking at segment memory for the master and it looks pretty constant at ~4GB throughout the 10 day cycle. About a day before the master gave out, I could see a spike in query cache but it only peaked at 163MB so that's pretty small in comparison to the overall 24GB heap. The number of open file handles looks constant to me as well.
Other observations possible theories we have that I would like test.
I understand this cluster is way over sharded at this point, so as a test we could turn off say half the shards in the cluster and run that way for a bit to see if we're still experiencing this gradual increase in heap usage on the master.
As I understand the master is creating indexes and assigning them to nodes. Is it possible the master just can't keep up with the indexing load? Is there a way to measure index creation on the master? I don't see any bulk, index, or merge thread pool rejections.
Any thoughts on where to take this investigation to gain some more insights? I'd like to try decreasing the number shards, but in my mind it would be nice to have a measure of why the heap is slowly increasing.