I am using the Logstash Http Input Plugin to receive logging data from an API Gateway (Apigee) and then send it into various S3 buckets. There are 6 Logstash nodes (m5.large 2 CPU/8 GB) sitting behind an AWS Application Load Balancer and processing about 20,000 calls/second in aggregate. Normally this works pretty well and the instances are chugging along at less than 50% CPU with no apparent networking or memory bottlenecks (although I could easily be wrong on that.)
However, I get a smattering of 429 codes which usually correspond to the busiest time of day as shown in the CloudWatch ELB 4XX count graph shown below. I do have ALB logging on and can confirm these are indeed ALL 429 (busy) errors.
Analyzing the ALB logs shows that these errors are NOT evenly distributed across the nodes and usually one or two of the nodes account for the majority of them in a 24 hour period. But it seems to vary across all 6 nodes, so no single one appears to always have a problem.
I initially tried to use a Logstash Persistent Queue on a single node to see how that worked. However, it made the number of 429s on that particular node MUCH worse. Unfortunately the upstream sender is a "fire and forget" so I don't know if I am losing data there or not.
I then turned to look at the max_pending_requests parameter in the HTTP Input Plugin to see if raising that would help. I raised a single node from the default of 200 -> 400. Unfortunately that didn't seem to make any difference. There isn't much guidance about "how high" I can raise that value. The notes indicate that can cause memory pressure and to be careful, but I don't know if "careful" means 250 requests, or 2500 requests. I'm now trying with 800 to see what happens.
Anyway, if folks have any guidance, it would be much appreciated. For now I continue to tweak parameters here and there and then analyze the ALB logs to see if the node I made changes to throws fewer errors.