Mean Time Between Failure Heartbeat documents

Hi,
I'm trying to calculate the mean time between recovery and the mean time between failure of some services monitored by heartbeat. For example, for MTBR, for each service I would like to get the time elapsed between two successive documents with the same monitor.id and having monitor.status down and up respectively. How can I do that?

p.s. I can also do further offline operations once I have obtained the data.

This is actually kind of tricky. I have a branch where I've been working on accurately doing this sort of work here: https://github.com/andrewvc/kibana/tree/timelines . You can track this issue: https://github.com/elastic/uptime/issues/55 . It's on our roadmap. Once we have that underlying infrastructure we can calculate things like MTBR accurately.

Thank you so much for the information. In the meantime, could you point me to a workaround maybe working a little with aggregations?

There's not really a great one that you can do in a single query. A prereq for timelines is including the frequency of the check with each message, which will let you calculate a somewhat accurate number for average time down over a period (just the sum of the frequency for all down checks). The timelines PR is more accurate (handling mis-scheduled items) but requires a lot of complex processing in JS.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.