CPU utilization of the whole cluster spikes up to 100% suddenly

Note that, during this 10-15 minute period, I see few requests were successful based on packetbeat data. So I am guessing few seconds of OLD GC could lead to request build up & CPU spike. and probably it takes 10-15 for the system to be stable again?

any improvements that we can make? We are looking to optimize filter caches as I said in original post. any other suggestions?

Thanks.