I'm currently running our infrastructure on ECS Fargate with logs going into CloudWatch. I've been less than happy with CloudWatch as a log tool and have used an ELK stack previously for smaller setups. We have a set of servers that have been running purely on EC2 + ASG and I would like to consolidate everything under ECS. However, these services require ebpf trace capabilities, so I'm basically forced to use ECS-EC2 and this would need moving all our services onto that.
Moving away from Fargate and to a full EC2 setup allow rethinking how we handle logs. What I would like to do is something along the following lines:
- Use either fluentd or syslog log driver for docker on the EC2 instances to collect logs.
- Send logs into a stream (Redis or similar) from the EC2 instances using rsyslog/syslog-ng/fluentd.
- Have logstash pull logs out of stream.
What I'm looking for is experience running such a setup under significant load in cost effective manner. We currently receive about 500K RPS, which is the rate at which access logs alone are generated. On top of that we have some other log events being emitted per request.
My question is therefore what event broker/stream would handle this well and how to setup logstash collection from it, so that the system can scale up? Another issue is cost. Throwing enough money at the problem Kafka would work here very well (and provide durability guarantees), but an in-memory streaming system with enough replicas should be fine, since critical events are also recorded outside of the logs in our data warehouse, so losing logs now and then would mostly just be annoying for debugging.
Looking for a starting point on which components to try and if anyone has some learnings to share. Will be setting up a test cluster in the coming weeks to do proper load testing.