This is a typical heap usage profile from nodes on the current cluster. There is under a TB of data, 75m docs with 600 ish shards across 80 indices. The boxes have now been upgraded from 32gig to 64 of memory with under 30 dedicated to the heap and 8 cores. Our 2 box dev cluster by comparison has a typical sawtooth heap profile.
We took the number of concurrent queries down from about 25 to 14. There are no aggregations on these, although ad hoc historic queries with aggregations can still happen.
Time spent in GC prior to 25/sec was about 15% of total time in a typical node. On dev it's a tenth of that. After lowering the number of concurrent queries GC time on main cluster is now around 8%. But the GC pattern still looks like above with massive spikes occurring frequently then immediate GC taking place for a couple of seconds at a time.
I've looked over the model and cannot see anything that could be causing this memory pressure. As anyone experienced this before?