I want to create a small script that will query ES for data, and then send me an alert when it sees that there is "anomalous" data.
For example, if the cpu load on a vm is suddenly spiking. I'd like an alert. That's easy enough to do if I just set a threshold, but I'd rather check for a sudden change in load/usage. Some vm's will naturally have a high cpu load, or RAM usage, with others will not.
I am digging into Elasticsearch Query DSL and the various aggregations to try and create my own script for this. Basically, run a query, check for a condition, and then send an alert, or not.
All my search results on this topic end up at proprietary solutions or Elastalert. I have no budget for this, and my attempts at getting Elastalert working were not successful. Though I may revisit once I understand how to search ES better.
Here are a few specific things I want to watch for:
A) If CPU load has gone up by more than 200% in the past 30 minutes.
B) If RAM usage has gone up by more than 200% in the past 30 minutes.
C) If the number of Apache requests has suddenly gone down by more than 50% in the past 60 minutes.
How would you go about watching for those things?
In my research, I discovered the Median Absolute Deviation aggregation. Would watching that be a way to get close to what I'm after?
I was able to build this query:
GET /_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": [
{
"match_phrase": {
"host.name": "learnescentos7"
}
},
{
"match_phrase": {
"agent.type": "metricbeat"
}
},
{
"match_phrase": {
"metricset.name": "load"
}
},
{
"range": {
"@timestamp": {
"gte": "now-30m",
"lte": "now",
"time_zone": "America/Los_Angeles"
}
}
}
],
"should": [],
"must_not": []
}
},
"aggs": {
"avg_load_1": { "avg": { "field": "system.load.1" }},
"max_load_1": { "max": { "field": "system.load.1" }},
"min_load_1": { "min": { "field": "system.load.1" }},
"variability_1": { "median_absolute_deviation": { "field": "system.load.1" }},
"avg_load_5": { "avg": { "field": "system.load.5" }},
"max_load_5": { "max": { "field": "system.load.5" }},
"min_load_5": { "min": { "field": "system.load.5" }},
"variability_5": { "median_absolute_deviation": { "field": "system.load.5" }},
"avg_load_15": { "avg": { "field": "system.load.15" }},
"max_load_15": { "max": { "field": "system.load.15" }},
"min_load_15": { "min": { "field": "system.load.15" }},
"variability_15": { "median_absolute_deviation": { "field": "system.load.15" }}
}
}
Would alerting on variability_1 being larger than 0 get me anywhere close to knowing if cpu load has gone up? I think I might be off track here since I think that aggregation will represent load going down as much as it does going up....
Any advice would be welcome. Thanks!