Is there any recommendation for the amount of training data to have available for ML?
We currently store one week's worth of data in Elasticsearch, which totals about 100GB of total storage. We have a process that cleans up any indexes older than a week, but want to begin taking advantage of the ML capabilities and know we will need to pump up the disk space we have to retain more data.
What we're unsure of is exactly how much data we need to retain. We currently store data associated with retail transactions and web traffic and have day-of-the-week, day-of-the-month, and monthly (seasonal) trends. What would be the recommended retention for data of this nature to take advantage of ML?
Quoting the above blog post: The sweet spot is usually about 3 weeks or 3 full intervals of periodic data
How much data we need to retain?
Machine Learning models, whether that is for anomaly detection or forecasting are self-contained, that means, once modeling has seen the data it does not need to re-access it, the important parts are incorporated into the model. But be aware that a model is constantly changing when feeding data in. So in theory you can immediately after feeding delete the data, but we do not advise to do that as you loose the ability to debug data problems and loose the ability to visualize it. Also not that models are snapshotted, which mean on a crash we need to re-feed the data between the snapshot time and the time of the crash occurred.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.