Detecting anomalies in metricbeats data


I am pretty new to ES ML and metric beats, so please excuse if my question seems a bit unusual.
So I setup a a job to detect anomalies in the field "system.cpu.user.pct", I am successfully able to detect anomalies, though I am struggling in finding ways to determine reasons for occurring anomalies, let’s take an example :-

In the following job I am detecting unusual increases in the metric(high_mean) system.cpu.user.pct. From the metric viewer we can see that on June 15th, 14:30, there is a sudden increase in the metric for ddlflsas102. Now, how can I go about finding a potential reason for the anomaly is my question. Just to be more clear, how can I pinpoint the reason for this sudden spike in the metric? What field from metric beats/ any other technique that could help me here.


Take ML out of the picture for a moment.

What do you think "causes" CPU spikes? If you had to investigate this manually and you had all of the possible information available, where would you look?

I'm not trying to be facetious - I am merely trying to demonstrate that the "cause" could be expressed somewhere completely different - for example, there might be a line in a log file somewhere that says "Begin bulk processing 1000000 records" or maybe a JVM on that machine just began Garbage collection. Unless you are watching (or have ML) on those other data sources, the "cause" might go unnoticed.

Also, be mindful that anomalies are statistical aberrations, not necessarily "alerts". For example, despite this being a "critical" anomaly because the actual value is 8x higher than it normally is - look at the actual value: still less than 1% CPU utilization. You must remember that:

a) the anomalies found by statistical analysis are data agnostic and completely relative to past history
b) if you know that anomalies of such low absolute value are not meaningful to you, then apply Custom Rules

Also sometimes anomalies on a metric, in isolation, are not always meaningful. Even a spike of the CPU to 100% still may have little to no substantive impact on the host or the application running on that host. However, anomalies in CPU, log messages, end-user response time, etc. all together may be much more informative. Consider broadening your ML jobs to related data sources and use the correlative power of the Anomaly Explorer to overlay the results of several ML jobs to look for clues to probable root causes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.