Need Help with Hardware configuration for EK-Stack for 2.5 TB of data per day

Hi everyone, this is my first post here. I am planning to deploy a cluster which will get around 2.5 TB of data per day. These logs will be parsed locally on a few logstash instances and will be sent to AWS which will host the Elasticsearch instances. Right now, I have thought of the below configuration:

Its basically, 3xMaster, 4xData node of c4 and c2 xlarge, with 19 times 16.3 TB of SSD. We will be using S3 since the log retention period is 3 months. Our plan is to store one month Log in the SSD for evaluation and 2 months of raw Logs which comes to approx 80TB in S3 buckets, so that we can index them as and when we need using lambda. we will be using curator to delete older data.

I need to understand whether the above Hardware configuration would be fine? If not, what would be the requirement as for CPU and Memory.

That's quite a lot for not many nodes. What testing have you done?

Hi warkolm,
there was a small error. Its 1.5 TB of logs per day. The log size will increase to 2.5 TB within a month. Thus, I am starting of with 4 Data nodes mentioned above, which we may have to increase. I have normally tested out on smaller nodes but we will be moving to production soon. We will be having an EPS count of 60000 approx. The Kibana will send requests for corelation of around 200-250 requests per day.

So, what should be the recommended configuration for such log size and such EPS count?

The amount of data a single onde can handle will often in the end be limited by how much heap it has available as each shard comes with some overhead in terms of memory usage. 4 data ondes sounds very little for that amount of data and retention period, so I would recommend running a benchmark to find out how much data one of your ondes can take while still having enough heap available to serve queries. This process is discussed in this Elastic{ON} talk.

we are using a ES as a search engine without kibana where our queries are around 400-500 per day with normal API calls from a website. We have a sizing of around 1 TB per day and 2 months retention period. When we started benchmarking the cluster, we used the default shard configuration and we are still using the same. Finally our cluster size is 3 master nodes, 12 data notes each with 64 GB of ram, and 8 core processor. We started with normal spinning disks with around 10k rpm, then 15k, but they were all very sloppy. We had bad response time. We then moved to EC2 GP2 SSDs, and is currently working well. We tried scaling the nodes, but its the worst possible thing we did, and I would never recommed that. Instead, we have a dedicated 2 person team who consistently monitor the health, and we increase the node as and when our data changes, and then move the data from new node manually and delte em safely later.

Elasticsearch doesn't need full time attendance like this, it sounds like you have a few problems that might be solved via other more efficient means?

Hi Joesph,
With scaling did you mean EC2 scaling? If yes, what issues did you face doing that?

we don't have any issue with the health of ES. it is normally green with going to yellow state like 2-3 times a month. wehn i said monitoring team, it mean't we dont use autoscaling for handling es. we do it manually. and what other efficient means you meant can you explain it?

Why do you need two people do to this?

they monitor not only es health, but management of clusters too. we dont use support from our vendor

we lost data with scaling, and our cluster shot straight to red state since it was unable to find shards which were in the scaled node

Hi Warkolm,
What is your experience on auto scaling of nodes under pressure in AWS? Any ideas?

It's not something I have done, I don't run clusters these days and haven't run production level ones for a while.

But we have users doing this and it works fine.

Hi warkolm,
Any ideas on how do you downsize a node once the peak pressure is gone when autoscaling?

I wouldn't, I would horizontally scale. Both up and down.

yes, what i meant was how do you take down a node after the peak pressure is gone. Since the new node will contain shards and indexes too

Using auto scaling when nodes hold a lot of data is generally a bad idea as this will result in a lot of data being transferred, adding extra load, at exactly the wrong time. I have seen it used, but generally for search use cases with small data volumes and high query rates where all nodes hold all data.

Thanks Christian.

It looks like you have in total 60TB of gp2 EBS storage and 12x R4.2xlarge instances for data nodes.
You could immediately achieve a more than $1200/mo saving by moving to I3.2xlarge instances, which have the same amount of memory and CPUs, but also include fast NVMe-based ephemeral storage.

If you subtract the price of R4.2xlarge from I3.2xlarge, you will essentially get a price of 1.9TB of storage which will be almost 2 times cheaper than EBS GP2.
Additionally, you should gain some performance improvement (just make sure to use ENA-enabled AMIs).
Just my 2 cents...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.