Size of Training Data

Is there any recommendation for the amount of training data to have available for ML?

We currently store one week's worth of data in Elasticsearch, which totals about 100GB of total storage. We have a process that cleans up any indexes older than a week, but want to begin taking advantage of the ML capabilities and know we will need to pump up the disk space we have to retain more data.

What we're unsure of is exactly how much data we need to retain. We currently store data associated with retail transactions and web traffic and have day-of-the-week, day-of-the-month, and monthly (seasonal) trends. What would be the recommended retention for data of this nature to take advantage of ML?

Thanks in advance!

Moved to machine learning forum

I recommend the following link: On-demand forecasting with machine learning in Elasticsearch | Elastic Blog

How much data is needed for training?

Quoting the above blog post: The sweet spot is usually about 3 weeks or 3 full intervals of periodic data

How much data we need to retain?

Machine Learning models, whether that is for anomaly detection or forecasting are self-contained, that means, once modeling has seen the data it does not need to re-access it, the important parts are incorporated into the model. But be aware that a model is constantly changing when feeding data in. So in theory you can immediately after feeding delete the data, but we do not advise to do that as you loose the ability to debug data problems and loose the ability to visualize it. Also not that models are snapshotted, which mean on a crash we need to re-feed the data between the snapshot time and the time of the crash occurred.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.