I'm working towards creating an ML job through the Kibana UI as well as using the ML wrapper API that we developed internally.
Using the wrapper, first I'm indexing all the test data and then starting the job as well as the datafeed. However, this is not letting to process all the records. If I use the same data index and create a job using the UI with same configuration parameters, the whole data index set is being processed.
This image shows the difference in number of records processed for the same data set- particularly, last two ML jobs are configured with the same parameters (including the query delay and frequencey) but still there is a difference in number of records processed as well as the latest date stamp.
The job named test is created using UI and processed all the 322 indexed records but the other jobs, which are created through the wrapper API only processed partial data.
PS: There is no invalid data as well.
Please let me know where the issue is and what could be the possible solution. I will be happy to provide more information as needed.
Sequence of API calls-- Creating the index for mapping -> Indexing the data in elastic search using bulk_api-> Creating the job-> opening the job-> creating the data feed-> starting the datafeed (running in the real-time)-> Retrieving the results periodically.
All the calls are automated using the scheduler, with the span of one day typically. But it will go down to the minute level soon
I also changed the Query delay and frequency to multiple levels till 300s and 3600s respectively. But still the problem remains the same. I'm not sure why the records till the current date stamp are not processed when job is created using the API.
I tried but it didn't work either. Normally, I wasn't passing the start time so it is processing the data from the earliest time stamp. However, for some reason, it wasn't processing till the current date.
When I start the job from 11/01/2016, it is processing the data only till a certain date- for example it is processing till
5/18/2017 and sometimes it is processing till 09/04/2017 but wasn't processing the data till the current date, which is in contrast with the job created from the UI.
Changing the query delay and frequencey didn't help me much.
The query_delay parameter is only relevant for real-time operation, not for historical data.
So, I've tried to replicate your situation and cannot. I created two identical jobs, one with the UI and one with the API and I get the same number of records processed, the full 86,274 events in the index:
I find it unusual as well because with the same configuration and with the same logic the number of records processed were different each time.
One work around, I found was to restart the datafeed after the initial set of records were processed- then it is processing the rest of the records. If 200 records were processed for the first time and if I restart datafeed after that, it is processing rest of the records and then the job is running real-time. I sense it is because of the lag in elasticsearch indexing and making the all the data points available before ml job started processing but it shouldn't be the case as the records are very very few in number. You might have a better idea here!
I'm also not sure if I have to restart the datafeed after indexing the records every time so that they can be processed- I'm yet to test this though.
What could be the possible issue? I will be happy to provide more information.
^^this is the only thing that is different in our setups then- since you ingest the data as part of the workflow, and I don't. I would suggest adding a wait of a few seconds after the command to _bulk ingest before starting the ML job's datafeed. If you're running it sequentially in a script, the time between those two events could literally just be milliseconds as you currently have it.
Uh! Adding the wait for a few seconds is the trick. It is processing all the records now. I hope I will be able to run the job in real-time without any need to restart the datafeed by setting up an optimal query delay and frequency.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.