Assume the following scenario
I have the resources to collect only 60 days of data
My job has been running for more than 200 days
Today I found out we captured an anomaly and we wanted to remove this through the following steps
Create calendar event that will have the timeline of anomaly occurance. This is to skip data from being learnt by the new job
Clone the existing job and run the cloned job
The problem with above approach is that the cloned job is trained only on the last 60 days of data. However my old job was based on a model that learnt from 200 days of data. So is there a way to take a snapshot of the model at any point of time (in this case, I want the snapshot of the model just before anomaly)? If yes, can I import this model in the cloned job and start feeding more data?
You can use the ML Model Snapshots API to revert the job to a model that was saved before your anomaly occurred, and you could pass the delete_intervening_results flag to delete the anomaly.
After this, you could start the datafeed again, but choose the start time to be after the anomaly.
Yes - it is a settable parameter in the create job API call:
model_snapshot_retention_days
(long) The time in days that model snapshots are retained for the job. Older snapshots are deleted. The default value is 1 , which means snapshots are retained for one day (twenty-four hours).
if I use create jobs api for an existing job, doesn't hit a resource_already_exists_exception? I need to know if modification is possible in an already existing job?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.