I am trying to set up an alert rule that will alert me when a job that I expect to run at particular servers stops writing into the syslog. The idea is that I want to receive alerts when:
Any host with a field server_type: "gitlab-runner"
Does not have a log message containing the phrase Total reclaimed space
In the last 8 hours
I set up a Log threshold rule with the configuration
WHEN THE count OF LOG ENTRIES
WITH message MATCHES PHRASE "Total reclaimed space"
AND server_type IS gitlab-runner
IS less than 1
FOR THE LAST 8 hours
GROUP BY host.name
(check every 2 hours)
The problem with this alert rule is that it considers all the incoming hosts. Most of the hosts do not have that Total reclaimed space in their logs (and they are not expected to), so they send alerts. That is not what I want. I want something to tell Kibana to "before doing anything else, ignore hosts not annotated with server_type: gitlab-runner". How can I do that?
I tried doing the GROUP BY you suggest, but the problem remains. The problem is that, when you say
you will only get from the hosts that have server_type
All my hosts have server_type defined. So this will still match everything. What I want is to only match hosts that have a particular value of server_type.
I saw the warning about not using Less than in a grouping query, but I could not think about an alternative. I am very much open to other approaches to this. All I need is an alerting rule that will check that a particular string appears in the logs on a regular basis.
So here is my test and it works and only limits to the the kubernetes.labels.app IS productcatalogservice
BUT this is a more than case
LOG VIEW Default
WHEN THE count OF LOG ENTRIES
WITH message MATCHES via_upstream
AND kubernetes.labels.app IS productcatalogservice
Add condition
IS more than 2000
FOR THE LAST 5 minutes
GROUP BY kubernetes.labels.app, host.name
BUT when I tried it with a less than case I got all the kubernetes.labels.app so you are right I think that an artifact of the "Less Than" ... and more I look at it I get why ...
LOG VIEW Default WHEN THE count OF LOG ENTRIES
WITH message MATCHES via_upstream
AND kubernetes.labels.app IS productcatalogservice
Add condition
IS less than 10000
FOR THE LAST 5 minutes
GROUP BY kubernetes.labels.app, host.name
Because in fact the condition IS actually met for every pod and host because there are 0 entries for all the conditions/combination so that is why all are being reported...
with the More Than those conditions are not Met.
So all that ... hmmm yup .... I need to think about that
so in the End if you are really just trying to figure out when the last time a particular service wrote a log...
I would use perhaps a latest transform and then a simple alert on top of that
Latest transform are pretty awesome way of keeping track of the "Last Event" from logs, services, hosts etc.... give a look.
I have used this more many use cases.... it works really well!
There also may be another way to do this with a DSL Query.
Yeah I ran into the same issue and also conclusion - it is technically correct that all the hosts are matched, because the condition applies to all
Huge thank you for pointing me in the direction of the Transform feature though. I have not used it before and it's exactly what I need! That solves my initial question.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.