Hi everyone, this is my first post here. I am planning to deploy a cluster which will get around 2.5 TB of data per day. These logs will be parsed locally on a few logstash instances and will be sent to AWS which will host the Elasticsearch instances. Right now, I have thought of the below configuration:
Its basically, 3xMaster, 4xData node of c4 and c2 xlarge, with 19 times 16.3 TB of SSD. We will be using S3 since the log retention period is 3 months. Our plan is to store one month Log in the SSD for evaluation and 2 months of raw Logs which comes to approx 80TB in S3 buckets, so that we can index them as and when we need using lambda. we will be using curator to delete older data.
I need to understand whether the above Hardware configuration would be fine? If not, what would be the requirement as for CPU and Memory.
Hi warkolm,
there was a small error. Its 1.5 TB of logs per day. The log size will increase to 2.5 TB within a month. Thus, I am starting of with 4 Data nodes mentioned above, which we may have to increase. I have normally tested out on smaller nodes but we will be moving to production soon. We will be having an EPS count of 60000 approx. The Kibana will send requests for corelation of around 200-250 requests per day.
So, what should be the recommended configuration for such log size and such EPS count?
The amount of data a single onde can handle will often in the end be limited by how much heap it has available as each shard comes with some overhead in terms of memory usage. 4 data ondes sounds very little for that amount of data and retention period, so I would recommend running a benchmark to find out how much data one of your ondes can take while still having enough heap available to serve queries. This process is discussed in this Elastic{ON} talk.
we are using a ES as a search engine without kibana where our queries are around 400-500 per day with normal API calls from a website. We have a sizing of around 1 TB per day and 2 months retention period. When we started benchmarking the cluster, we used the default shard configuration and we are still using the same. Finally our cluster size is 3 master nodes, 12 data notes each with 64 GB of ram, and 8 core processor. We started with normal spinning disks with around 10k rpm, then 15k, but they were all very sloppy. We had bad response time. We then moved to EC2 GP2 SSDs, and is currently working well. We tried scaling the nodes, but its the worst possible thing we did, and I would never recommed that. Instead, we have a dedicated 2 person team who consistently monitor the health, and we increase the node as and when our data changes, and then move the data from new node manually and delte em safely later.
we don't have any issue with the health of ES. it is normally green with going to yellow state like 2-3 times a month. wehn i said monitoring team, it mean't we dont use autoscaling for handling es. we do it manually. and what other efficient means you meant can you explain it?
Using auto scaling when nodes hold a lot of data is generally a bad idea as this will result in a lot of data being transferred, adding extra load, at exactly the wrong time. I have seen it used, but generally for search use cases with small data volumes and high query rates where all nodes hold all data.
It looks like you have in total 60TB of gp2 EBS storage and 12x R4.2xlarge instances for data nodes.
You could immediately achieve a more than $1200/mo saving by moving to I3.2xlarge instances, which have the same amount of memory and CPUs, but also include fast NVMe-based ephemeral storage.
If you subtract the price of R4.2xlarge from I3.2xlarge, you will essentially get a price of 1.9TB of storage which will be almost 2 times cheaper than EBS GP2.
Additionally, you should gain some performance improvement (just make sure to use ENA-enabled AMIs).
Just my 2 cents...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.