I’m using AWS Elasticsearch and need advice on what instance types to choose for my cluster and. How many shard to use. My use-case is the following:
My system produces at max 150,000,000 records per day (e.g = 1736 records per second, each record is ~900 bytes, = 1,5MB per second). My system will publish records to ES in a batch of 5 MB (since 5000/1500 = 3.3 sec, than every 3.3 seconds there will be a batch request to ES with ~6,000 records each).
The number of time this data will be read from ES in Kibana is relatively small (for example, maybe 2-3 times a day). So, for me, the most important part is selecting EC2 instances for my nodes that has a lot of storage and that can perform a lot of write operations.
I need to figure out, how many instances I need, and what type of the instance will fit my needs? Also, I need to understand, how many shards I need while creating index
I run load test for the amount of data I described below, having just 3 m4.large.elasticsearch
The size of the cluster is often driven by a combination of ingest volume and storage needs. As you have not outlined how long you need to keep your data it is hard to provide any sizing guidance. How much data a node can ingest and hold will depend on the data, querying and how well you optimise your indices. Make sure you follow the guidelines around sharing in this blog post and also watch this webinar.
Even though you are primarily indexing into the cluster, you will also need to take querying into account as I assume you still have some performance expectations when you actually do query the data.
I need to keep the data for long time period (at least a year). I calculated that having 13GB of data per day, will give me 4TB of data per year. I need to make sure that I will have enough storage. As I mentioned, I don't need to run queries on regular basis. Just sometimes. Yes, I want my queries to be quick, but for me right now the most important part is writing to cluster and storing the data for long time, rather then querying it
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.