Elasticsearch- Xpack- MLAPI: Unable To Retrieve The Results


(Rohithnama) #1

I'm working towards developing an API Wrapper X-Pack ML. So far, I was able to create multiple jobs but recently, the jobs that I created aren't showing any results at all. I verified the mapping and even the records were processed in the API but it couldn't result in output on anomaly jobs. Any help would be highly appreciated. Below is more info.

But I'm able to get the results for the job using that I created through the Kibana UI using the same datafeed/dataindex. sSo I'm not really sure what I'm missing.

The logger prints the following info, which I'm unable to figure out.

[2017-09-08T10:01:00,464][ERROR][o.e.x.m.j.p.l.CppLogMessageHandler] [59b2a277c5aab66deff4c5be] [autodetect/28149] [CAnomalyDetector.cc@223] Records must be in ascending time order. Record 'theday,Amazon,.
839980800,2025729.375,' time 839980800 is before bucket time 1504879200

I will be happy to provide more info. Any kind of quick help would be appreciated. Thank you.

(rich collier) #2

If you're getting the above message in the logs, this must mean that you're trying to send data directly to the Job via the _data API call? Otherwise, it would be impossible to send out-of-order data to the algorithms if using the "datafeed" (which gets the raw data from an existing elasticsearch index in the cluster). This is because data that originates from an elasticsearch index through the datafeed is time ordered naturally.

If you are indeed sending data from an external source to the API, then yes, you must present that data in strict chronological order, unless you set the latency parameter in the config of the job.

And, if you're sending out of order data to the API, it is a likely explanation of why you're not getting results.

(rich collier) #4

Looking more closely, 839980800 is a date in 1996 and 1504879200 is a timestamp of today, September 8, 2017.

You cannot send data from the past into the algorithms once that "bucket" (of size=bucket_span) has already been processed.

(Rohithnama) #5

Hey Rich,

Thanks for your prompt response.

Yes, I'm sending the data from an external source (by retrieving from our product) and sending it to the X-pack API. I have to analyze the data from the past as well. What would be the other way to do it?

Please let me know.

(rich collier) #6

The only other way is to ingest that data from an external source into an elasticsearch index of your choice, then use the ML job to analyze the data from that index.

Actually, this is preferred way, in general, as it makes the UI of Machine Learning more sensible, since the charts drawn can show the anomaly in the context of the raw data:

(Rohithnama) #7


As per the work I'm doing, I need to do it programmatically- Through Python!

In this case,

  1. First, I need to index the data in elasticsearch-> Gives the index_id

  2. Create a Job and send the data feed using index_id that we obtained through first step.

  3. Retrieve the results.

  4. We can use scheduler to automate this process.

Please correct me If I'm wrong!

Thanks for your time!

(rich collier) #8

Yes, that workflow should be just fine. Are you only loading in historical data or are you also planning on ingesting data in "real-time"?

(Rohithnama) #9

Yes, I'm planning to ingest real time data as well.

The goal is to ingest the historical data-> Run the anomaly job-> and then ingest the real-time data and run the job continuously using scheduler.

Do I need to make any modifications in this case? or Any other things that I need to keep in mind?

(rich collier) #10

That should work just fine - just be careful of the timing between step 1 and 2 - (between time that raw data gets ingested/indexed by elasticsearch and when the ML job expects to find and analyze that data).

(Rohithnama) #11

Yeah, I will make sure to index the data before getting the job started. Thank you so much for quick help.

I should be done in a day or 2. I will update here If i'm good with the process.

Thanks again Rich!

(Rohithnama) #12

Hey Rich,

Now, it is working absolutely fine. But do you think x-pack should be able to support this feature? Because, if we have a million data points and we need to create a ML job, we should be able to do it in parallel and kind of similar to the real-time, instead of waiting for the data to be indexed to elasticsearch and then creating the job. Moreover, when creating the API's based on X-Pack, I don't think it's a good idea to keep the user awaiting till the data is indexed and the results are loaded!

This may not make much sense but would like to know your thoughts though.

I also think that this info needs to be updated in the documentation as I wasn't able to figure it out and spent couple of days checking for issues in my code/logic- I might have missed but I did a good research and this could be the case for many!

Looking forward to hear your opinion.

(rich collier) #13

That's good to hear.

Again, you don't have to index the data first into Elasticsearch in order to use ML - but there are benefits if you want to use our UI.

If you plan on writing your own UI, then there is little need to index the data first. We have customers that "OEM" the ML technology who do exactly this (write their own UI and choose to keep raw data elsewhere).

However, like I said If you are indeed sending data from an external source to the API, then yes, you must present that data in strict chronological order, unless you set the latency parameter in the config of the job. There's no other way around this.

(Rohithnama) #14

Yeah, the plan is to use our own UI and support ML for large volume of
data. Based on what you said, we need to index the data and then run the ML

Is there any work around to avoid the latency of indexing the large volume
of data when running the ML job for the first time?!

If not, do you think we need to open a ticket, as I can see the need to
support this scenario.

Sorry for multiple follow ups. This clarification is well needed for my
current work. Thank you for your patience.

(rich collier) #15

Yes - send the data in chronological order (if using the API, not indexing into Elasticsearch). If you want to index into Elasticsearch - remember that you don't necessarily need to index it all at once.

Again, there are two ways you can architect your solution:

  1. Index the raw data first into Elasticsearch index - either a bunch of data at once, or every X minutes/seconds/etc. Then ML will query out the data every X minutes/seconds/etc.
  2. send raw data in (mostly) chronological order to the _data API endpoint (as long as it is in time order within a tolerance less than the set latency). ML will process the data as it gets it.

Since you're planning on using your own UI, I'd think you'd choose #2. Again, you don't have to, but the only reason to do #1 at this point would be to centralize the raw data if it is already not centralized.

But, in either case - you don't have to wait for all of the data that you'll ever want to analyze to be gathered up before sending to ML. ML can take the data in batches. Big batches (like weeks or months worth of data) or small batches (seconds or minutes worth of data). If the batches of data are bigger than the bucket_span of the job, the ML algorithms will process the data as fast as possible. If data is fed in batches less than bucket_span, ML will still process the data, but at the rate that's defined by bucket_span.

Hope that's clear

(Rohithnama) #16

Hi Rich,

I got a chance to discuss this with the team member and was able to sort out the issue based on the points you mentioned.

I do appreciate for the quick response and the great insights.

Thank you!

(system) #17

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.