I am trying out the anomaly detection job.
The data is coming from logstash at interval of 5 mins. But sometimes there will be no data in those 5 mins. The data itself will be randomly distrubuted in the 5 min slot. Sometimes there will be 5 data points at start of the time slot. Sometimes in middle. Sometime at end.
Like all these are possible scenarios:
No data:
00000
Data at start:
D0000
Data in middle:
00D00
Data in end:
0000D
I have kept the Query delay and Frequency delay to 5m. My idea is to not miss any data.
The suggested Bucket Span was 30m.
Should it not have been 5 mins?
Thanks for response @richcollier.
Is it fine that I keep Query delay equal to 1d? I assume the datafeed keeps a track of the point till which it has taken in the data to avoid the duplicate issues. And the only cost for me will be a more expensive query since the time range is bigger.
I am asking this since the data is actually coming from production line. And they run the line when needed. There is no schedule. There maybe days during which they do not make anything. And few days when the run the line 24hrs non stop.
There are 3 important parameters: bucket_span, query_delay, and frequency
bucket_span is the analytics aggregation interval frequency is how often the data is queried via the datafeed query_delay is the total offset (from "now")
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.