Elastic Fleet Agent stops processing TCP ingest, UDP continues

Hi All,

I’m running an Elastic Agent (8.19.3) configuration with multiple Generic TCP and UDP inputs. Deployed onto Ubuntu Server.
I’ve noticed that yesterday at around 10:30 (Oct 14th), the TCP ingest has stopped completely, but UDP has kept ticking on. After restarting the agent logs are being processed again.

In the Kibana UI (fleet, specifically) the reported memory usage was 66G! That’s ever so slightly concerning.

Similarly, open handles increased in exactly the same fashion, up to 52,800 open handles!

Failed event rate for tcp seemed consistent throughout this time, spiking immediately upwards once memory started increasing, and staying consistent around that point even while memory continued to increase.

This is far from an area I’m confident in, but this definitely looks like some sort of bug. These are not particularly high throughput logs (around 80/s), so a bit concerning to see this sudden increase. Especially considering it has been working fine for a number of weeks prior without any configuration changes.

The setup is definitely far from ideal (the fleet agent server is not running standalone, and has other integrations deployed alongside it – I’m in the process of migrating this) however I have not made any changes to the actual agent configuration outside of the defaults.

Any ideas on how I can troubleshoot this? Of course I’ll continue monitoring the usage over time, but would be good to be able to determine a root cause. Not looking for anyone to take the troubleshooting workload away from me, just need a pointer in where I should start looking :slight_smile:

Thanks!

Welcome to the forum @essinghigh

there's system limits on open files, and also on available TCP port numbers. If there is slow leak you will eventually run out of one or the other. Tools like lsof can help you track usage of ports/files.

Well, this is sort of what forums like this are for, to help in situations like yours!

Thanks for the quick response!

Understand the system limits, though I’m not sure if I could consider a memory increase from ~400M → ~25G in four hours a slow leak. I would assume the open handles should be closing themselves after a sufficient timeframe?

If it were these limits I would have expected to see this increase starting from when we began ingesting logs over TCP (which is done at a reasonably consistent rate) a few weeks ago… but seems this dramatic uptick only started yesterday with no changes made.

One thing I forgot to mention is that while running a packet capture and filtering for the TCP port the agent was listening on, I could see syslog messages sent, and it looked like the box was responding completely normally…

It doesn’t look like I have any logs to get to the bottom of it, so I’ll keep an eye for a while and see if it’s a reoccurring issue.

The point about slow leak is taken, I had not noted the x-axis values on the graph.

I am speculating thats not happened.

There's 2 points here - why did the memory usage suddenly start growing linearly, and why did ingest on the TCP port stop processing. I think the latter is because, and it's almost obvious to say it, is that you hit some limit - memory usage / open handles cannot grow for ever.

If you had lsof or ss or netstat outputs from some time during the "memory inflation" period then I suspect you would see connections and/or open files growing almost linearly. What could cause this? This I dont know, but if you can reproduce it I think it would be a bug. Maybe some network noise, like some retransmissions or lost packets, might confuse the agent somehow. "multiple Generic TCP and UDP inputs" is a bit vague, but again with the diagnostics you would probably know which of these inputs is where you need to start.

Another thing you could do is "tap" the specific interface onto some other system, on the sink system saving that traffic into a cycling buffer for N hours, allowing you to do later traffic analysis.

1 Like