Elasticsearch Cluster size

I have a dataset, on which I would like to perform Anomaly Detection.
The data size until now is 15-20TB and it grows 10-15GB per day. I do not need to query much the data (of course I will do some data exploration on my own, or maybe just 1-2 user at a time, but won't definitely power any large scale multi-user application). So the main purpose is Anomaly Detection and I will be only managing it.

At the beginning, I am planning to use just the static dataset of 15-20TB. So:

  • Will the ratio 50GB per shard still apply in this case?
  • How many data nodes should I use?
  • How many machine learning nodes should I use?

If I let the dataset grow with the ratio mentioned before, will the parameters above change much?

Thank you

Cluster size calculations depend on many things, including:

  1. Ingest rate per day
  2. Data retention (how long you expect to keep the data around)
  3. Expected search performance (how fast queries need to be)
  4. Node hardware and storage hardware performance

In other words, you could architect a cluster that has a low-ish node count on mediocre hardware, but could operate decently, or you could architect a cluster that has a right-sized amount of nodes on fast hardware (with SSDs for example) that would be blazingly fast.

If you are truly interested in doing Anomaly Detection is a production environment, then that is a paid feature. If you are in the market for paying for an Elastic Subscription, then you can also have one of our Solutions Architects help you size a cluster for the appropriate use case. Just contact us (sales@elastic.co) and we're here to help.

If, on the other hand, you are just doing this for academic research (which seems to be the case from your previous posts), then you're going to be a bit on your own on this one. You can get guidance from our blogs - for example:

https://www.elastic.co/blog/sizing-hot-warm-architectures-for-logging-and-metrics-in-the-elasticsearch-service-on-elastic-cloud

Good luck!

1 Like