A spark-streaming data ingestion service sends a heartbeat every batch loop duration (in a field called
ingestion_alive, with a value of 1). So, if the batch loop is 2 minutes, we expect to receive
30 (+-3) heartbeat messages per hour.
What we would like to see is how many times the ingestion has failed and thus didn't send a heartbeat message. Let's say that during the last 24 hours, the ingestion failed 3 times. The first time it was offline for an hour, the second time just for 15 minutes, and the 3rd time for 3 hours straight. The question is, how to get the number of times the ingestion has failed?
I first thought I would sum the
ingestion_alive field values per interval, then do cumulative sum over it (as seen in the picture), then apply the
derivative() function over the cumulative sum. The result would be that the derivative would create a slope of zero over the time periods when there was no heartbeat. Then, I could find the number of derivative points which is zero. However, applying the
derivative() function over the
cusum() function only results in the original time-series with no functions applied. What am I doing wrong here?
.es(index=default.gelf*, timefield='@timestamp', q='CF_APPLICATION_NAME:data-ingestion', metric='sum:ingestion_alive').cusum()