I have a few machine learning jobs running all on the same index pattern. They are single metric doc count jobs. I am getting new data in but my count in the Single Metric Viewer flat lines to 0. I have no idea what I'm doing wrong. They are not marked as not receiving new data just doesn't show it. Any advice would be greatly appreciated.
I assume that if you were to run the ML job over historical data, you don't get this problem (it's only with live data)?
If that's the case, the most likely culprit is that your ingest of your data is delayed to the point that when ML "looks for it" (i.e. ML looking for the data in the last X minutes, which is equal to the bucket_span) - the data is not yet there - but is there at a later time.
You can check to see how many events the ML job is seeing by looking into the .ml-anomalies-* index for your
job_id and for
result_type:bucket - and inspect the
event_count field. Here, you can see a consistent 870 events per bucket.
So, if you're finding that only your "live data" is getting missed, you probably need to increase the
query_delay parameter of the datafeed configuration:
...or figure out why things are not getting ingested/indexed as fast as they could be.
This seems to be my problem. My data doesn't come in continuously. So I'll get a burst of data starting at about 10 after the hour and that will last until about 10ish mins. This data reflects the previous hour so I'll get data for 9:00-10:10 in one big burst from about 10:10-10:20. Is there a good way to configure ML for this kind of setup. As this is external data there is no way for me to change the delivery to a more constant method.
So one option would be to do a 1 hour
bucket_span and a
query_delay of 30m. That would mean that at 10:30, ML would ask for, and analyze data from 9:00-10:00. Then at 11:30, ML would would ask for, and analyze data from 10:00-11:00, and so on.
If you desire a smaller
bucket_span for analysis reasons (let's say 10m), then you'd have to increase
query_delay to something like 90m. So that in this case, at 10:30, ML would ask for, and analyze data from 9:00-9:10 and then at 10:40, ML would ask for, and analyze data from 9:10-9:20, and so on.
In both of the above cases, make the
frequency parameter equal to the
bucket_span. For data that is more real-time, often the
frequency parameter is a fractional value of the
bucket_span. (See https://www.elastic.co/guide/en/elasticsearch/reference/6.2/ml-put-datafeed.html for more info). But, in your case, it doesn't really make sense for
frequency to be anything other than equal to the
Hope that helps
Thanks this seems to have done the trick!