We have a scenario whereby we have a number (hundreds) of remote servers on customer sites. Each server periodically (once an hour) runs a speed test and feeds these results back to Elasticsearch via Filebeat. Each result is stored as a number (float) in either a download or upload field. This is all showing in discover as we would expect.
I was hoping to use ML anomaly detection to keep track of these results for each server and report on, well, any anomalies.
I feel like this is entirely possible, but the question is what is the best approach? Can we have one job that will perform this for every server, perhaps based on hostname as a differentiator? Or would we require a separate job for each server? The former is obviously much more attractive, just not sure it's doable?
Any advice would be appreciated.
It is totally possible to split across a unique field like hostname. What you want is a "multi-metric" job where the
partition_field_name is the
hostname field (or whatever the keyword field is on which you want to partition).
This will create one anomaly detection job and support each individual hostname.
Hope this helps
Thanks for the reply Ben. This is exactly what I have already been looking at, which is encouraging!
I have initially been testing with just two servers, trying to monitor the upload speeds. So I have the detector set to mean upload, and partition field set to host.hostname.keyword. I am getting results in the graph, but one thing that makes me think it's not quite working correctly, is in the host.hostname.keyword drop down menu highlighted in the screenshot, I only see 1 hostname, but below it says "Single time series analysis of avg upload (2 distinct host.hostname.keyword values)". I would have expected to see both hostnames of servers that are feeding their results in?
You are doing everything correct and this is purely the behaviour of this view. (In fact, we've recently updated its behaviour to only display the chart when a partition field value has been selected because we felt the behaviour which is confusing you is confusing!)
As it stands, the single metric viewer shows the average of all host mean(upload) values and all anomalies for any host if you don't select a host. As soon as you choose a host (from the dropdown you highlighted) it will show just the mean(upload) for that host and just the anomalies associated with that host. The analysis will always create a separate model for each host and labels the anomalies with the host to which they relate.
Once you have several partitions it can be useful to start off looking at the Anomaly Explorer. In this view you'll see swim lanes of the anomalies for each host, there is a link which allows you to drill down from this view to see anomalies for individual hosts in the Single Metric Viewer.
So, from the drop down menu that I highlighted, should I see multiple hostnames (or whatever partition field I set) that I can then choose from?
I have tried with multiple different splits (partitions), host.hostname, agent.name, and they always show two graphs during set up, see below as an example;
But then when the job is actually created, from the drop down menu I only ever see one, not two.
From what you're saying this doesn't seem right?
I think the most likely reason for this is that there are no anomalies associated with one of the partitions, i.e. it either has very little data associated with it or its values are sufficiently stable that we haven't detected any anomalies for it. The logic to populate the values shown in that dropdown is based off checking the partition field values present (in a subset of) the anomaly records which the job generated. You should still be able to manually enter the host name and see its data in this view however.
Thanks Tom. I'll let it run for 24 hours and then purposely create some anomalies by limiting the speed on the servers we're testing with and see if that makes a difference. Will let you know how we go.
Just a quick update to let you know having left it a few days the drop down menu did indeed populate properly with both hosts, so it looks like it was just a lack of data and/or anomalies.
Thanks for your assistance here @BenTrent and @Tom_Veasey
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.