I'm working on building a scaled out ELK setup for a production system in AWS with EC2 instances. I need it to handle about 175GB of logs a day with a possible fluctuation of 50GB give or take. On average i would estimate this is about 2750 event/s. The logs are coming from multiple app servers running on ec2 instances that have an aggresive log rotation during production hours. Incremented backup logs are created when the rotation occurs. I would use filebeat to read from the backup logs, but do to the nature of the app, information might be missed in the case of a failure.
My Plan is to use to filebeat to run the multi-line filter then ship the logs from the app instances. the logs would go to (2) logstash instances [m4.large = 2cpu/8Gb ram]. there are only a couple simple grok filters that need to run, one mainly for timestamp and another for a specific value. Then from there output this to (4) elasticsearch instances [m4.large] each with 500Gb ebs volume of standard ssd. the logs are only kept for a week so curator will be running as a cron to clear them out after that.
for elasticsearch im planning on (3) of the nodes to be master/data and (1) node just data. I would also probably run Kibana on one of the instances or perhaps just a seperate t2 instance since i dont think it uses much resources. and I'm thinking of running with the settings of 5 shards and 1 replica for ES. For the host discovery i was going to use "discovery.zen.ping.unicast.hosts" which would include all 4 nodes.
Since the log rotation is so aggressive on the app instances I feel there is a risk of missing logs in the case that there is a pipeline backup on logstash. In which case I would add two redis instances in between filebeat and logstash but im going to try it without them first.
With (4) instances of ES with 500GB each and 1 replica enable, does that mean I technically have 1TB of storage since the other 1TB would be for the replicas?
Any recommendations on the setup or areas of concern?