Elasticsearch io wait


#1

I have a 4 nodes elasticsearch cluster (64G RAM, 8 CPUs), data stream in at the speed of 800-3000 messages per second, mainly usage is for aggregation. It was working very well for a few month without problem, but today, it has cpu usage 100% when io wait was high, and messages are lost.

I suspect it is because of searching query is too demanding, is there any way I can check slow query logs?

Newrelic screen catch

Marvel screen catch:

Zoomed in marvel screen catch at the peak time:


(system) #2