We are using Elastic 5.6.3, 3 nodes with 64gb mem (2 physical and one VM on CentOS) and .net core services which query and write to index.
Our platform is a classifieds site with many search facets; we are using Datadog to monitor various health signals. The most pressing concern atm is that the JVM heap use across all the nodes would be very stable for about 4 hours and then garbage collection would become very erratic and more frequent. This would lead to a slower query time and less stable cluster.
We are recycling the IIS service every 4 hours which then results in normal garbage collection patterns for the next 4 hours. There are no other backend services querying the cluster.
The question is; what is the best way and most obvious metrics to measure to understand why JVM heap use is normal for hours and then slowly starts to degrade?