Elastic Agent: UDP package processing limits

I am using the Elastic Agent with CEF integration, which receives and processes UDP data on a port. In the metrics for the cef integration, I see the value file-beat_input.system_packet_drops .
The integration discards UDP packets every minute. Larger amounts of UDP traffic arrive there, but the CPU, network, and RAM are all underutilized.

Does the Elastic Agent have an internal limitation/bottleneck that only allows a certain amount of data to be processed, even if the system on which the agent is running is not actually fully utilized?
Is it necessary to distribute the traffic across multiple integrations/ports as far as possible?

Yes, there is the internal queue of the agent, which is a memory queue, the size of the queue will depend on the output configuration, what is your output? Elasticsearch? You can change the size of the queue by changing to one of the pre-defined outputs or using a custom configuration, more information can be found here.

Once the queue is full, the input will stop accepting new events until the current events are processed, this does not depend only on the specs of the host running the agent, they can be all underutilized if the destionation can not keep up with the ingestion rate.

So, if your output is Elasticsearch and it cannot index data as fast as you receive the events, then it will apply some backpressure into the Agent, which will then tell the input to slow down, in the case of the UDP input, this means events will be dropped.

To troubleshoot and fix this you need to troubleshoot the entire ingestion flow, from the source to the destination.

1 Like

That's interesting, but I'm still concerned that increasing the queue size will only handle peak loads. Currently, according to the metrics, thousands of UDP packets are being dropped every minute every day. Yes, they are transmitted to Elasticsearch nodes. The agent is managed by fleet and can communicate with multiple data nodes. But even here, the data nodes are far from being fully utilized in terms of CPU, heap, RAM, network, and IO.

How can I investigate or debug this further?

I think that looking into the logs of both Elasticsearch nodes and Elastic Agent would be a start.

Do you have anything in the logs of the Agent?

Also, what is the event rate?