Is it possible to redirect one machine learning (anomaly detection) job to a new data stream having the same sets of fields of the old historical index when it is live?
Background:
We have 10 ML jobs (anomaly detection) currently running in production. We took the last 1 year of data for building the models and then make the jobs live for anomaly detection (Bucket span 4h) in real-time. Now the issue is that the index is becoming too big (50gb+) and we thought to close the index and create a data stream instead and enable ILM to it.
Now, can we redirect the datafeed to the new data stream without breaking the live job?
Will it affect the model?
Please let me know so that we can make the necessary changes to handle this large index.
When you restart the datafeed you can specify start=0 and it will pick up from the time it was stopped.
That depends on how different the data in the new data stream is. If the new data stream contains identical data to the old index but just in a data stream then there should be no impact whatsoever on the model. If the new data stream turns out to contain different data (either deliberately or due to a mistake) then the model could change significantly as it learns from the different data.
Hi @droberts195 ,
When I was discussing about this implementation with my team, one question came in our mind.
So when we change the index in the datafeed as you suggested and make the job live, post to that when the job takes a new snapshot, will it have the model parameter from both the data source ( old and new index) or it will contain parameter from the new source only?
Actually our concern is, if we change the data source, will the model drop the historical parameters when new snapshot comes?
Let me give you an example. Lets assume the model before the activity is y=ax+b where a and b are the parameters. Now if I change the datafeed to a new index, and post to that when a new model snapshot is taken by ml job i. e., y=a1x+b1, will a1 and b1 has the information of the historical a and b or it will simply come from the new index data?
Hope I am able to explain the problem statement. If not please let me know.
This is important because we will make the changes in the production elastic stack. So we need to be sure how this works.
It will snapshot a model that is a blend of data behaviors from the old index and the new index. As @droberts195 said, if the data is mostly consistent, it will be like no change happened. If the new data does have a significant difference in the behavior, then the blended model will, over time, drift away from being like the "old data" and more like the "new data".
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.