I am running a cluster with 20 9TB data nodes (32GB heap), 3 client nodes (24GB heap), and 3 master nodes (20GB heap). This cluster has been running stable under monitoring for the past 90 days. Yesterday, nowhere near a UTC rollover event which would create new daily indices, we started seeing ramped up GC time on the masters. The GCs are more frequent and taking longer, here's a graph:
The two saw-tooths are from my manual restart of the master. The newly elected masters show the same memory growth.
Here is the time spent in GC:
What can I do to debug this issue? In the past it has been a problem with out-of-control schema growth from poorly processed data. But here I'm not sure, how can I track down how memory is being used in the master?