Choosing the Correct AWS EC2 Servers

Hello everyone. I'm in the process of creating my newest cluster that will eventually be the basis for my production cluster. My question, what types of servers would you recommend to run the master nodes, data nodes, and HTTP (coordinating node)?

Some stats about the cluster: 3 master nodes (2 in availability zone A and 1 in availability zone B), 6-8 data nodes ( half in each availability zone with the ability to scale horizontally into each zone as the data increases), and 1 or 2 http nodes (Coordinating nodes, not completely sure if these will be needed).
The primary function of this cluster will to be hold one index that is currently scoped at 16-20 terabytes, with one replica 32-40 terabytes (to ensure there is a full copy of the data in each availability zone). I am scoping the primary shards to contain around 50gb once the full amount of data is loaded, as the soft limit (from what I've heard is 50-8-gb a primary shard).

The documents themselves are pretty massive, I believe around 2-5 mb per document. Any answer can be as specific or general as you like, I'm just looking to know what server types would be recommended (as per those available through AWS) for both the data nodes and master nodes.

If desired I can attach a basic VISIO diagram of what this cluster looks like. Thanks for any/all help!

BUMP

Choosing the right shard size, node count and specification depends a lot on the data, indexing load as well as query types, volume and latency requirements. I suspect the given information is insufficient for anyone to give an answer and that you will need to perform some testing and benchmarks to get a good and reliable answer. Have a look at this talk about cluster sizing for some guidance.

The problem I have currently with aws is there's no way to grow local storage without horizontally expanding. EBS is just not able to handle queries with the amount of data you and I have. Plus, EBS is pretty costly compared to local storage. If you have to use aws, make sure to use instances with nvme storage @61meg ram (which doesn't allow for a lot of data node storage). This lack of local storage has pushed me to configure my own on-premise servers to get as much performance as possible, and to have the ability to add more memory/storage to a node rather than increase number of nodes, this works as long as processing power is not completely consumed. Still working on the new cluster.

@Christian_Dahlqvist understood.

@morphers82 Thank you, this was very helpful and the exact kind of advice I was looking for. Without asking you to get into much detail (don't want to waste your time :slight_smile: ) would you recommend Memory Optimized servers for data nodes ( X1, R4) vs the General Purpose ( T2, M4 ) servers? My ES instincts tell me I should choose memory optimized over compute optimized or storage optimized. I'll attach a pdf of my current cluster design. Also I will be scaling horizontally, rather then vertically, as I add data to maximize the java heap ( given from each machine ). I will be allocating more primary shards then nodes during the initial index creation to take advantage of the extra data nodes that are added. Also I find it very interesting that you say EBS is not able to handle queries with that amount of data. thanks again!

@morphers82, Hopefully this is helpful in understanding what this cluster design will be like. Currently looking into your comment about nvme. Also I'm trying to use replica shards to ensure a full copy of the data is present in each availability zone.

Update: Based on your previous reseponse I am looking at I3 servers on AWS (Storage Optimized). Also the data I will be utilizing the cluster for is non-transactional.

Drawing4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.