Hi Guys,
Have a scenario based questions on alerting , How does ES decides if an alert is "recovered"? Whats the core logic ?
i am setting Airflow DAG failure alerts , i receive a document which contains fields DAG:xyz State: Failed and i create a search alert rule saying i need to search for any document which contains "State: Failed" for all the DAGs for the last 30 minutes , and i run this check every 15 minutes.
The condition is met and the ALERT is in ACTIVE state. But after 30 minutes ( 2 runs) i can see that the ALERT is "RECOVERED" , but the last run of my dag is still failed , because there is no success or another failure there is no new document after 30 minutes which the search could find and i think because of that it has marked it "RECOVERED", Is this the right way to mark any alert recovered ?
To keep the alert active until the DAG run is successful, you need to adjust your alert condition. Instead of checking for "State: Failed" documents in the last 30 minutes, you could check for the absence of a "State: Success" document for the specific DAG in the last 30 minutes. This way, the alert will remain active until a successful run of the DAG.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.