Got in over my head. Basic setup for large Elasticsearch cluster

I have gotten in a bit over my head on an elasticsearch project, and am seeking some very basic architectural information. I am trying to deliver a basic elasticsearch cluster for some researchers, who will put the cluster under a small load...I imagine less than 100 queries per day. Also, new documents will be added intermittently, perhaps a gig here or there every couple weeks / months. I currently have 1 Terabyte of data made up of 50 - 75ish billion records (records count is very fuzzy though).

Based on the info here https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/sizing-domains.html

Minimum Storage Requirement = Source Data * (1 + Number of Replicas) * (1 + IndexingOverhead) / (1 - Linux Reserved Space) / (1 - Amazon ES Overhead)

So for my case:

data_size = 1T
storage = data_size * 2 * 1.1 / 0.95 / 0.8 = 2.89 = ~3T
storage = ~3T

Also:

So for my case:

desired_shardSize = 30GB
shards = ~3000GB * 1.1 / desired_shard_size = ~110
shards = ~110 ``

Also, the above link says the following:

In general, the storage limits for each instance type map to the
amount of CPU and memory that you might need for light workloads. For
example, an m4.large.elasticsearch instance has a maximum EBS volume
size of 512 GiB, 2 vCPU cores, and 8 GiB of memory.

Since I believe that the cluster falls under a "light" workload, would it be accurate

to say that if I chose to use an m4.large.elasticsearch, then I would want to provision 3T / 512GB = ~6 m4.large.elasticsearch instances ?

So in total I would need:

~3T storage on 110 shards, spread accross 6 nodes (EC2 servers).

I am fuzzy on the different types of nodes aka master nodes, clients nodes and data nodes. Do the above node calculations cover all of the nodes you would need? I understand that a "data node" holds data for searching and that a "master node" controls the cluster. Does this mean that I would have 6 data nodes (based on above calculations) and then some number of master nodes?

Is the above reasoning sound for my use case? Also, are there any recommend resources that you know of to guide a novice user through a setup of the type described above?

Thank you for any an all input.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.