ML alert for non-reporting servers and which server

machine-learning

(Ryan Downey) #1

I'm trying to set up a ML job that can track the servers that we have deployed metricbeat/packetbeat to and get alerted when one stops reporting. From what I understand of the ML functions is that many of them use numbers and not strings for the job. In our case were looking for the total number of beats.name or Number of Hosts [Metricbeat System] so using a count aggregation and then trying to use the Field of beat.name doesnt work since the server names are in string format, although I could very easily be misunderstanding this. For the visualization Number of Hosts [Metricbeat System], beat.name is being used but its aggregating all of them to produce a number using what appears to be the cardinalty aggregation. Overall we're trying to be alerted to the fact that one of our servers has stopped reporting and which one it is. That is the end goal and any help would be greatly appreciated. Also there are probably other ways to do this so feel free to point me in another direction.

My thought process so far has led me down the road of potentaially creating an array within the "field_name" value ("field_name" : "[]") that would produce a total number for us to work off of. ML would then be able to use the low_count function, which should stay constant at 96 for us, to give us a heads up when the number decreases and won't alert us as we add more servers. Although this still wont help us figure out which server isn't working from what I can tell.

PUT _xpack/ml/anomaly_detectors/metricbeat_monitoring
{
"analysis_config": {
"detectors": [{
"function" : "low_count",
"field_name" : "["cardinality": {
"field": "beat.name"} //This doesnt work obviously
bus something like this???//
]"
}]
},
"data_description": {
"time_field":"timestamp",
"time_format": "epoch_ms"
}
}

I've also been looking at Datafeeds and trying to figure out if setting up anything like that would work. Maybe even using the Watch API?

Below is the result JSON of the Number of Hosts [Metricbeat] from what I've found.
{
"size": 0,
"_source": {
"excludes": []
},
"aggs": {
"1": {
"cardinality": {
"field": "beat.name"
}
}
},
"version": true,
"stored_fields": [
""
],
"script_fields": {},
"docvalue_fields": [
"@timestamp",
"ceph.monitor_health.last_updated",
"docker.container.created",
"docker.healthcheck.event.end_date",
"docker.healthcheck.event.start_date",
"docker.image.created",
"kubernetes.container.start_time",
"kubernetes.event.metadata.timestamp.created",
"kubernetes.node.start_time",
"kubernetes.pod.start_time",
"kubernetes.system.start_time",
"mongodb.status.background_flushing.last_finished",
"mongodb.status.local_time",
"php_fpm.pool.start_time",
"postgresql.activity.backend_start",
"postgresql.activity.query_start",
"postgresql.activity.state_change",
"postgresql.activity.transaction_start",
"postgresql.bgwriter.stats_reset",
"postgresql.database.stats_reset",
"system.process.cpu.start_time"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "
",
"analyze_wildcard": true,
"default_field": ""
}
},
{
"query_string": {
"analyze_wildcard": true,
"default_field": "
",
"query": ""
}
},
{
"range": {
"@timestamp": {
"gte": 1537896549877,
"lte": 1537897449877,
"format": "epoch_millis"
}
}
}
],
"filter": [],
"should": [],
"must_not": []
}
},
"highlight": {
"pre_tags": [
"@kibana-highlighted-field@"
],
"post_tags": [
"@/kibana-highlighted-field@"
],
"fields": {
"
": {}
},
"fragment_size": 2147483647
}
}


(rich collier) #2

Hi Ryan,

You're sort of on the right track here. You indeed want to use the count or low_count function, but keep in mind this counts the number of documents returned in the datafeed query for each bucket_span. In other words, you don't need to count a field, count just automatically counts the number of documents that result (return) from the query being made.

The other thing to understand is that a job can be split along a categorical field, such as beat.name. Therefore, no matter how many beat.names you have, the ML job will build a baseline model for each one.

So, you can accomplish this via a config like this:

...
  "analysis_config": {
    "bucket_span": "15m",
    "detectors": [
      {
        "detector_description": "count for every host",
        "function": "low_count",
        "partition_field_name": "beat.name",
        "detector_index": 0
      }
    ],
    "influencers": [
      "beat.name"
    ]
  }
...

Then, you're golden! Any time any beat.name reports less than expected volume, you'll get an anomaly!

p.s. this can also be accomplished using a multi-metric job in the UI


(Ryan Downey) #3

Rich,

This got us up and running. Appreciate you taking the time to help us out with this. Enjoy the rest of your day!

Ryan


(Mark Walkom) #4