I have a 4 nodes elasticsearch cluster (64G RAM, 8 CPUs), data stream in at the speed of 800-3000 messages per second, mainly usage is for aggregation. It was working very well for a few month without problem, but today, it has cpu usage 100% when io wait was high, and messages are lost.
I suspect it is because of searching query is too demanding, is there any way I can check slow query logs?
Newrelic screen catch
Marvel screen catch:
Zoomed in marvel screen catch at the peak time: