I have a dataset, on which I would like to perform Anomaly Detection.
The data size until now is 15-20TB and it grows 10-15GB per day. I do not need to query much the data (of course I will do some data exploration on my own, or maybe just 1-2 user at a time, but won't definitely power any large scale multi-user application). So the main purpose is Anomaly Detection and I will be only managing it.
At the beginning, I am planning to use just the static dataset of 15-20TB. So:
- Will the ratio 50GB per shard still apply in this case?
- How many data nodes should I use?
- How many machine learning nodes should I use?
If I let the dataset grow with the ratio mentioned before, will the parameters above change much?