@tomhe, thanks for sending over the GC logs. They confirm that the heap usage went above 95% - looks like for an extended duration.
I will need to spend some time analysing/absorbing the output. But there is more GC output that could be relevant here and if possible it would be nice to enable following GC logging on the nodes:
9-:-Xlog:gc*,gc+age=trace,gc+ihop=trace,gc+heap=trace,gc+humongous=trace,gc+phases=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
This will produce somewhat more output though it looks like a reasonable amount from my local experiments. It could be advisable to try out on one node first before rolling out on on all nodes (if you are at all in a position where you can enable this).