I'm experiencing an issue with Kibana generating false Recovery Monitor UP documents for monitors that have been offline for a long time. Specifically, when a host is monitored using heartbeat and is offline for an extended period, Kibana generates a recovery document when the host has not come back online. This results in false notifications being sent out unnecessarily.
I'm wondering if anyone else has experienced this issue and if there are any potential solutions or workarounds. I've checked the logs and error messages in Kibana, but haven't found any clear indications of what might be causing the issue.
To help troubleshoot this issue further, I've attached the alerts documents from Kibana and the heartbeat documents from the time of the notifications to this post. If anyone has any suggestions or ideas on how to resolve this problem, I'd greatly appreciate your input.
Heartbeat and Kibana documents:
Thanks in advance for your help!
Stack Version is: 7.17-5
Considering the alert query, I think it might be related to a bit of a close scheduling:
ANY MONITOR IS DOWN > 12 times WITHIN last 180 seconds
What Kibana is looking for in this case is not a monitor with "up" status, but rather if any monitor has been "down" less than 12 times. Mind you, if your monitor is not even running twelve times in those 180 seconds, the alert will be considered resolved.
If your monitor is running close to twelve times in those 180 seconds, there might be cases where one of the executions has drifted a bit and got out of the alert window.
The first step would be to check if you have > 12 "down" statuses for the interval the alert reported as "recovered", I can only see 8 in the screenshot. IF you have > 12, then "recovered" alert would be have been fired incorrectly.
Thank you for your response. I appreciate your suggestion about the close scheduling, but I actually have many more documents than the 8 shown in the screenshot I provided. I apologize for not being more clear about that.
Regarding my question about how to configure the alert to notify me after 3 minutes of down status, even if new documents are generated every 15 seconds: do you have any suggestions on how I can achieve this? I want to avoid receiving false notifications due to documents being generated during the alert window.
Should I switch It to 11 from 12 in 180 seconds?
Thank you for your help!
Hi @Adriann ,
I think this approach is mostly correct, just cutting it a bit too close. This might be coming into play too.
Kibana would need 13 "down" statuses for the alert to continue active, instead of the intended 12. If your monitors run every 15 seconds, that's sometimes not the case.
This should not trigger as many times:
ANY MONITOR IS DOWN > 11 times WITHIN last 180 seconds
And if you still see many recovered events, you can try tuning it down to 10.
Thank you for the clarification. I believe I was testing many edge cases, but it was a year and a half ago, so I don't remember much from my conclusions then. I have updated the alert query to reflect your suggestion:
**ANY MONITOR IS DOWN > 11 times WITHIN the last 180 seconds**
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.