GC not working

I've been having a lot of problems dealing with the GC on my cluster. It always reaches a point where the heap is filled up, the GC doesn't evict data and my nodes become unresponsive. I've tried many different things to no avail. I'm currently dealing with 20 shards, 2 replicas, 16GB of ram (8GB for heap), mlockall set to true, ran "sudo swapoff -a" on all the machines. However, I still get unresponsiveness after a while due to GC not doing its job:

[2016-06-28 11:44:45,302][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50439][1134] duration [24.7s], collections [1]/[24.8s], total [24.7s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [42.5mb]->[38.1mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:45:19,942][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50440][1135] duration [34.5s], collections [1]/[34.6s], total [34.5s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [38.1mb]->[37mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:45:44,822][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50441][1136] duration [24.7s], collections [1]/[24.8s], total [24.7s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [37mb]->[36.3mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:46:18,494][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50442][1137] duration [33.5s], collections [1]/[33.6s], total [33.5s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [36.3mb]->[36.4mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:46:43,184][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50443][1138] duration [24.5s], collections [1]/[24.6s], total [24.5s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [36.4mb]->[35.3mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:47:16,802][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50444][1139] duration [33.5s], collections [1]/[33.6s], total [33.5s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [35.3mb]->[34.9mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}
[2016-06-28 11:47:43,330][WARN ][monitor.jvm ] [ursa-es-data-node-18] [gc][old][50445][1140] duration [26.4s], collections [1]/[26.5s], total [26.4s]/[5.9h], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [399.4mb]->[399.4mb]/[399.4mb]}{[survivor] [34.9mb]->[32.3mb]/[49.8mb]}{[old] [7.5gb]->[7.5gb]/[7.5gb]}

Does anyone have an idea of what I can do to try to debug this problem?

Please send me the output of http://localhost:9200/_nodes/stats?pretty&human
If my guess is correct, most of the heap is occupied by the filed data cache.
If that is the case we have alternative to reduce the filed data cache.

You need to dump the node stats API and maybe the heap and find out what's sitting on the heap that can not be freed (field data, suggesters, etc.). Then you'll have to address accordingly.

I ran that and it was a really really long output. I made a gist on github for it:
gist

Thanks for helping me look into it! I'm still not proficient enough myself to be able to debug this type of problems.

ursa-es-data-node-18 is one of the problematic nodes at the moment.

It seems, the output which you send is not during the problematic period.
Based on the out put cluster is started around 50 mins back and there are no frequent GCs happening on any machines.
Can you please send me the output when your cluster is un-responsive state?

Also, some other observations:
22 node cluster
only 1 eligible mater node.
1 client node
20 data nodes.
In my opinion, only master node for this cluster is very bad choice. We need to have at least 3 master nodes to make the cluster highly available.