I'm working towards developing an API Wrapper X-Pack ML. So far, I was able to create multiple jobs but recently, the jobs that I created aren't showing any results at all. I verified the mapping and even the records were processed in the API but it couldn't result in output on anomaly jobs. Any help would be highly appreciated. Below is more info.
But I'm able to get the results for the job using that I created through the Kibana UI using the same datafeed/dataindex. sSo I'm not really sure what I'm missing.
The logger prints the following info, which I'm unable to figure out.
[2017-09-08T10:01:00,464][ERROR][o.e.x.m.j.p.l.CppLogMessageHandler] [59b2a277c5aab66deff4c5be] [autodetect/28149] [CAnomalyDetector.cc@223] Records must be in ascending time order. Record 'theday,Amazon,.
839980800,2025729.375,' time 839980800 is before bucket time 1504879200
I will be happy to provide more info. Any kind of quick help would be appreciated. Thank you.
If you're getting the above message in the logs, this must mean that you're trying to send data directly to the Job via the _data API call? Otherwise, it would be impossible to send out-of-order data to the algorithms if using the "datafeed" (which gets the raw data from an existing elasticsearch index in the cluster). This is because data that originates from an elasticsearch index through the datafeed is time ordered naturally.
If you are indeed sending data from an external source to the API, then yes, you must present that data in strict chronological order, unless you set the latency parameter in the config of the job.
And, if you're sending out of order data to the API, it is a likely explanation of why you're not getting results.
Yes, I'm sending the data from an external source (by retrieving from our product) and sending it to the X-pack API. I have to analyze the data from the past as well. What would be the other way to do it?
The only other way is to ingest that data from an external source into an elasticsearch index of your choice, then use the ML job to analyze the data from that index.
Actually, this is preferred way, in general, as it makes the UI of Machine Learning more sensible, since the charts drawn can show the anomaly in the context of the raw data:
That should work just fine - just be careful of the timing between step 1 and 2 - (between time that raw data gets ingested/indexed by elasticsearch and when the ML job expects to find and analyze that data).
Now, it is working absolutely fine. But do you think x-pack should be able to support this feature? Because, if we have a million data points and we need to create a ML job, we should be able to do it in parallel and kind of similar to the real-time, instead of waiting for the data to be indexed to elasticsearch and then creating the job. Moreover, when creating the API's based on X-Pack, I don't think it's a good idea to keep the user awaiting till the data is indexed and the results are loaded!
This may not make much sense but would like to know your thoughts though.
I also think that this info needs to be updated in the documentation as I wasn't able to figure it out and spent couple of days checking for issues in my code/logic- I might have missed but I did a good research and this could be the case for many!
Again, you don't have to index the data first into Elasticsearch in order to use ML - but there are benefits if you want to use our UI.
If you plan on writing your own UI, then there is little need to index the data first. We have customers that "OEM" the ML technology who do exactly this (write their own UI and choose to keep raw data elsewhere).
However, like I said If you are indeed sending data from an external source to the API, then yes, you must present that data in strict chronological order, unless you set the latency parameter in the config of the job. There's no other way around this.
Yeah, the plan is to use our own UI and support ML for large volume of
data. Based on what you said, we need to index the data and then run the ML
job.
Is there any work around to avoid the latency of indexing the large volume
of data when running the ML job for the first time?!
If not, do you think we need to open a ticket, as I can see the need to
support this scenario.
Sorry for multiple follow ups. This clarification is well needed for my
current work. Thank you for your patience.
Yes - send the data in chronological order (if using the API, not indexing into Elasticsearch). If you want to index into Elasticsearch - remember that you don't necessarily need to index it all at once.
Again, there are two ways you can architect your solution:
Index the raw data first into Elasticsearch index - either a bunch of data at once, or every X minutes/seconds/etc. Then ML will query out the data every X minutes/seconds/etc.
send raw data in (mostly) chronological order to the _data API endpoint (as long as it is in time order within a tolerance less than the set latency). ML will process the data as it gets it.
Since you're planning on using your own UI, I'd think you'd choose #2. Again, you don't have to, but the only reason to do #1 at this point would be to centralize the raw data if it is already not centralized.
But, in either case - you don't have to wait for all of the data that you'll ever want to analyze to be gathered up before sending to ML. ML can take the data in batches. Big batches (like weeks or months worth of data) or small batches (seconds or minutes worth of data). If the batches of data are bigger than the bucket_span of the job, the ML algorithms will process the data as fast as possible. If data is fed in batches less than bucket_span, ML will still process the data, but at the rate that's defined by bucket_span.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.