Hi,
we use alarms to get notification about missing (or new) machines onbording to our elk. Therefore we querry the docs containing heartbeats for the last 15min and have a look if every machine has more than 12 beats. This should safely indicate if a machine is up or not. But since I must cover 3 min of loss, it's not possible tho react faster to the alarm. And furthermore, I get false alarms, if I powerup a machine.
Currently we use this alarm querry:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"mqput_time": {
"from": "{{period_end}}||-15m",
"to": "{{period_end}}||-0h",
"include_lower": true,
"include_upper": true,
"format": "epoch_millis",
"boost": 1
}
}
},
{
"term": {
"Payload.IoT.First.Name.keyword": {
"value": "Heartbeat",
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"aggregations": {
"Datenquelle": {
"composite": {
"size": 10,
"sources": [
{
"LocationCaption": {
"terms": {
"field": "Payload.Location.Caption.keyword",
"missing_bucket": false,
"order": "asc"
}
}
},
{
"IoTFirstName": {
"terms": {
"field": "Payload.IoT.First.Name.keyword",
"missing_bucket": false,
"order": "asc"
}
}
}
]
}
}
}
}
as result I retrieve this:
{
"_shards": {
"total": 10,
"failed": 0,
"successful": 10,
"skipped": 9
},
"hits": {
"hits": [],
"total": {
"value": 41,
"relation": "eq"
},
"max_score": null
},
"took": 1010,
"timed_out": false,
"aggregations": {
"Datenquelle": {
"buckets": [
{
"doc_count": 13,
"key": {
"IoTFirstName": "Heartbeat",
"LocationCaption": "Demo"
}
},
{
"doc_count": 14,
"key": {
"IoTFirstName": "Heartbeat",
"LocationCaption": "Loc1"
}
},
{
"doc_count": 14,
"key": {
"IoTFirstName": "Heartbeat",
"LocationCaption": "Loc2"
}
}
],
"after_key": {
"IoTFirstName": "Heartbeat",
"LocationCaption": "Loc3"
}
}
}
}
and we put this to a mail using this:
Alarm on machine!
reason:
Monitor {{ctx.monitor.name}} @ {{ctx.trigger.name}}
sources:
{{#ctx.results.0.result}}
• {{LocationCaption}} with {{value}} of {{ref}} required beats
{{/ctx.results.0.result}}
Link to dashboard:
{{#ctx.results.0.result}}
• {{LocationCaption}}:
{{{url}}}
{{/ctx.results.0.result}}
This works fine so far, but currently I just can define a alarm by using doc_count to be at a certain amount. Asuming, A station posts 1 heartbeat pre minute, there should be 12 at least, if it is still running.
This is the selection of data from querry to results (for creating above mail):
ctx.results[0].result = [];
for (bucket in ctx.results[0].aggregations.Datenquelle.buckets){
if(
bucket.key.IoTFirstName == "Heartbeat") {
bucket.key.ref = 12; //Ueberwachungswert für Anzahl Heartbeats
bucket.key.url = "myULR)";
bucket.key.value=bucket.doc_count;
if (bucket.doc_count < bucket.key.ref) {
// attach to result array
ctx.results[0].result.add(bucket.key);
}
}
}
if (ctx.results[0].result.length > 0) return true;
In a dashboard i would rather use Top_Hit than "RED=Count<13". So in my thought it would be better to get something like:
For identification of "died" machines ("no beat within the last 5 min"):
METACODE: IF Date(Now)-5min > Top_HIT(msg.timestamp) THEN ... ADD TO RESULT
For identification of "new" machines ("oldest beat in scope younger than 5 min"):
METACODE: IF Date(Now)-5min < Last_Hit(msg.timestamp) THEN ... ADD TO RESULT
It would be fantastic if you could help me creating such a querry - I googled a lot, but could not find a matching sample so far.
Thanks!