Alert on no data per source?

here @leandrojmp suggested a Log Threshold to get reported when no logs are added in the last 5 mins. This works but only if you add the specific source as a WITH condition. In my case I want to be reported on say 6 sources (but not 2 other stale sources in the index history). It seems it's not suitable in this case at least in the UI since you can only have WITH...IS...AND where the more suitable would be something like WITH source IN [a, b, c] and GROUP BY source.

Is there any alternative besides manually creating 1 alert per source? The closest I've gotten is Query DSL with min_doc_count 0 which displays the aggs as expected in Dev Tools (e.g. a: 0, b: 16, c: 44) but either doesn't work in the alert UI or I'm not setting it right (I tried WHEN sum/count grouped over source is below 1)

Hi @Mike8 What version are you on?

And when you say by source...
Does that mean source type like nginx logs
Or does that source mean by host?

Perhaps clarify a bit...

On Newer version there is an

Alerts -. Observability Alert -> Custom Threshold Alerts

Which will alert per "Group By" including when no data
So this says Alert when above X or when there is not data by host.name

Version is 8.17.0. I didn't try Custom type before. Source is basically cluster name. When I left the default, it triggered, but that was triggering for logs being too high.

Instead I tried:
COUNT all documents

EQUATION A
IS BELOW 1
FOR THE LAST 5 minutes

Group alerts by: source/cluster

I also checked "Alert me if there's no data"

1 source has many logs; 1 has no logs for weeks now. But alerts only seem to trigger if I change IS BELOW to IS ABOVE.

It seems like many of these alert types don't consider buckets if doc_count is 0. I tried to limit it by adding a query filter like source: "live-cluster" OR source: "dead-cluster" but it didn't help, mainly because the dead-cluster I would ultimately want to exclude from consideration (but for now it's good to attempt to trigger the alert).

One caveat it has to get logs from a source before it can be considered missing.

That's the only way it can know if something's missing...

If you have 10 sources that are logging and two go stop sending logs those will be alerted on

If you add an 11th source that's never logged there's no way to know it's missing data

I've tested these cases and they work.

I will tell you using them below 1 is not a good approach... For the exact reason that you say when a bucket is zero there's nothing to report on...

That's why you want to use the alert me if missing data because that way we're keeping track

1 Like

When you say logging, you mean based on the alert? Because old logs are still accessible via explorer when filtering for source: "dead-cluster", On the fence between this and multiple Log Threshold alerts then; at least with the log threshold I can test it / be confident it works & how the email looks

Hi @Mike8

Apologies, I should have been more clear

If you have three hosts and they are sending telemetry to elasticsearch and in this case I was referring to say application or system logs But it could be anything. It could be metric data etc.

And you set up an alert like I showed before that's looking for number of error messages greater than 100 over the last 5 minutes

If one of those hosts stop sending the telemetry over the 5-minute alert window, you'll get an alert that there is no data. That's what the little checkbox means

Clearly there's still historical telemetry in elasticsearch The alert me when there's no data. It's to help understand when a host container etc. Stop sending telemetry that you're trying to alert on.

These are all pretty easy to test. Set up a single host... Have it sent telemetry... Set up an alert like I showed you... Stop sending telemetry and you'll get an alert

Hopefully that makes sense

The group by is very powerful. I would suggest looking at that

But in the end if you want to set up individual alerts for every host or whatever you want to partition by, that's up to you

Hopefully this makes sense

I will say the method I showed you is a pretty popular way to do what we're discussing because a single alert can can be accross many host etc.

If you need different threshold / condition per host then you will need to create separate alerts.

I specifically broke the otel config on a cluster to test it:

This view is with query filter: source:"cluster-a" OR source:"cluster-b"
COUNT all documents
EQUATION A IS BELOW 1
FOR THE LAST 5 minutes
Group alerts by: source
Alert me if there's no data is checked

In the dashboard you can clearly see that the 2nd cluster stopped sending data but no email was sent regarding that cluster no longer sending data.

However, my related Log Threshold alert did trigger an email. It seems that Alert me if there's no data would only take effect if every single cluster stops sending data for the 5 minute period; not a single cluster/bucket. Again presumably because the UI lacks a checkbox like "consider empty buckets/doc_count=0"; instead it just sees a general "hits = 50 so there's data"

Hi @Mike8

And you ran the alert .. not just tested it...

Hmm not my experience (nor intention, nor what the documentation states) ... If I get a chance I will test again.

  • Has "group alerts by" fields: If a previously detected group stops reporting data, a "no data" alert is triggered for the missing group.For example, consider a scenario where host.name is the group alerts by field for CPU usage above 80%. The first time the rule runs, two hosts report data: host-1 and host-2. The second time the rule runs, host-1 does not report any data, so a "no data" alert is triggered for host-1. When the rule runs again, if host-1 starts reporting data again, there are a couple possible scenarios:

    • If host-1 reports data for CPU usage and it is above the threshold of 80%, no new alert is triggered. Instead the existing alert changes from "no data" to a triggered alert that breaches the threshold. Keep in mind that no notifications are sent in this case because there is still an ongoing issue.
    • If host-1 reports CPU usage below the threshold of 80%, the alert status is changed to recovered.

To clarify after looking more: The alert shows as Active in the UI but no email was sent. I assume the default setting is OK:

Yesterday I was thinking maybe there was a strange reason of a lingering status that prevented it but it seems no. Yesterday I noticed it still showed Active because of the query previously being wrong (changing the query did not invalid the previous active alerts); although I would hope that 1 cluster being active would not block the email alert of another when using group by.

I changed it to Untracked so the alert would be Recovered. But now I triggered the Alert from Recovered to Active and still no email from Custom (only Log Threshold) :confused:

Aplogies having a little trouble following did you see this?

If you want another email you have to add another action

Ah ok thanks. It finally triggered an email when changing to "No Data". I also broke a 2nd cluster and it also triggered an email. But it's concerning that when viewing the alert details, the older broken cluster somehow had its status become "Recovered" although no logs have been sent from it for the last 2 hours. Basically can't use the Active/Recovered view as a trustworthy source

I had similar issues in the past where an alert as marked is recovered not because data start flowing again, but the group that it was tracking had no more data in the specific interval.

Also had issues with alerts triggering too later or not triggering at all.

Being honest, in my experience with Elasticsearch alerting on no data is something that is unnecessarily complicated to do.

We decided to write a custom python script to trigger some ESQL queries and feedback this information into Elasticsearch to be able to create more reliable alerts.

I think that I can easily replicate some issues I had to open a Github issue, but I need to find time to do that.

The false positive recovery would be the easiest one to replicate.

Trying again I wasn't able to repro the status marking itself Recovered unexpectedly but the fact it happened once concerns me that I may just do 1 Log Threshold alert per cluster. That seems to be reliable except for my mind not wanting 6-10 identical alerts besides cluster name :sweat_smile: also it reads more logically than Custom threshold where the actual condition (EQUATION BELOW 1) does not work at all...someone in the future might see that & think the Alert me if there's no data is not necessary and suddenly there would be no working alerts. Whereas with Log Threshold the actual query is used for deciding if to trigger the alert.