I have a few machine learning jobs running all on the same index pattern. They are single metric doc count jobs. I am getting new data in but my count in the Single Metric Viewer flat lines to 0. I have no idea what I'm doing wrong. They are not marked as not receiving new data just doesn't show it. Any advice would be greatly appreciated.
Hi David,
I assume that if you were to run the ML job over historical data, you don't get this problem (it's only with live data)?
If that's the case, the most likely culprit is that your ingest of your data is delayed to the point that when ML "looks for it" (i.e. ML looking for the data in the last X minutes, which is equal to the bucket_span) - the data is not yet there - but is there at a later time.
You can check to see how many events the ML job is seeing by looking into the .ml-anomalies-* index for your job_id
and for result_type:bucket
- and inspect the event_count
field. Here, you can see a consistent 870 events per bucket.
So, if you're finding that only your "live data" is getting missed, you probably need to increase the query_delay
parameter of the datafeed configuration:
...or figure out why things are not getting ingested/indexed as fast as they could be.
This seems to be my problem. My data doesn't come in continuously. So I'll get a burst of data starting at about 10 after the hour and that will last until about 10ish mins. This data reflects the previous hour so I'll get data for 9:00-10:10 in one big burst from about 10:10-10:20. Is there a good way to configure ML for this kind of setup. As this is external data there is no way for me to change the delivery to a more constant method.
So one option would be to do a 1 hour bucket_span
and a query_delay
of 30m. That would mean that at 10:30, ML would ask for, and analyze data from 9:00-10:00. Then at 11:30, ML would would ask for, and analyze data from 10:00-11:00, and so on.
If you desire a smaller bucket_span
for analysis reasons (let's say 10m), then you'd have to increase query_delay
to something like 90m. So that in this case, at 10:30, ML would ask for, and analyze data from 9:00-9:10 and then at 10:40, ML would ask for, and analyze data from 9:10-9:20, and so on.
In both of the above cases, make the frequency
parameter equal to the bucket_span
. For data that is more real-time, often the frequency
parameter is a fractional value of the bucket_span
. (See https://www.elastic.co/guide/en/elasticsearch/reference/6.2/ml-put-datafeed.html for more info). But, in your case, it doesn't really make sense for frequency
to be anything other than equal to the bucket_span
.
Hope that helps
Thanks this seems to have done the trick!