Elastic-Agent -> Logstash high EPS missing events

Hi Everyone

We have a rather large setup in which we’ve replaced rsyslog with Elastic Agents as log collectors. We’re running 8.19.8.

Currently we have around 1200 devices forwarding mostly syslog to a load balancer that distributes these logs to 13 Elastic Agents (the policy has 13 integrations: 11 TCP, 1 UDP, 1 HTTP) which in turn forward to a logstash output (3 Logstash endpoints pointing to 12 Logstash nodes).

Collect agent metrics is enabled and when looking at the integrated Agent metrics dashboards we see some activity in the “events failed rate /s” lens for the metrics streams but none for the actual logs/ tcp integrations. We also don’t see any dropped events in the “events dropped rate /s” lens.

However, during daily operations we noticed that there were logs in the legacy solution that we could not find in Elastic.

After some tweaking on the Logstash side and the Elastic Agent Logstash output configuration as well as adding more agents and more resources to the agents our ingest rate went from around 100k EPS to 175-200k EPS indicating that somewhere we lost around 50% of our log volume.

What we observed then and can still observe now is that all Elastic agents show 100% queue utilization most of the time (with the current setup it can at least be emptied a few times per day).

Looking at all the metrics we have Elastic Agent is the most likely culprit. I couldn’t find much information with regards to sizing and queue optimization for a setup that processes this much so my hope is that the community can give me some pointers if our sizing makes sense and what we could tweak to improve performance.

Our Logstash output settings look as follows:

worker: 16
bulk_max_size: 4096
queue.mem.events: 131072
queue.mem.flush.min_events: 4096
queue.mem.flush.timeout: 0s
compression_level: 1
pipelining: 4
loadbalance: true

After setting the flush timeout to 0 there was a significant uptick in EPS hence we left it at that value.

The containers running Elastic Agent have been assigned 8 CPU cores and 16GB of memory (which seems oversized when looking at the metrics charts).

I’d love to have a discussion with people that have experience with similarly sized setups what we could improve to stabilize our throughput and ensure that we don’t lose any logs while still not going overboard with resources.