Timelion heartbeat downtime count analysis

A spark-streaming data ingestion service sends a heartbeat every batch loop duration (in a field called ingestion_alive, with a value of 1). So, if the batch loop is 2 minutes, we expect to receive 30 (+-3) heartbeat messages per hour.
What we would like to see is how many times the ingestion has failed and thus didn't send a heartbeat message. Let's say that during the last 24 hours, the ingestion failed 3 times. The first time it was offline for an hour, the second time just for 15 minutes, and the 3rd time for 3 hours straight. The question is, how to get the number of times the ingestion has failed?

I first thought I would sum the ingestion_alive field values per interval, then do cumulative sum over it (as seen in the picture), then apply the derivative() function over the cumulative sum. The result would be that the derivative would create a slope of zero over the time periods when there was no heartbeat. Then, I could find the number of derivative points which is zero. However, applying the derivative() function over the cusum() function only results in the original time-series with no functions applied. What am I doing wrong here?


It took a while for me to realize, but derivative of cumulative sum is basically the graph without any aggregation.

Cumulative sum is basically: x1 + x2 + x3 (for bucket 3 for example)
Cumulative sum for bucket 4 would be: x1 + x2 + x3 + x4.
Derivative for bucket 4 would be Cumulative sum for bucket 4 minus Cumulative sum for bucket 3, which is x1 + x2 + x3 + x4 - (x1 + x2 + x3 ) = x4.

What I would do is plot the moving average of the count/sum and put a.static line on the chart with 30 as the value. This way you can see where it's below the line. (or make it 27 in order to account for variations +-)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.