I am using the Elastic Agent with CEF integration, which receives and processes UDP data on a port. In the metrics for the cef integration, I see the value file-beat_input.system_packet_drops .
The integration discards UDP packets every minute. Larger amounts of UDP traffic arrive there, but the CPU, network, and RAM are all underutilized.
Does the Elastic Agent have an internal limitation/bottleneck that only allows a certain amount of data to be processed, even if the system on which the agent is running is not actually fully utilized?
Is it necessary to distribute the traffic across multiple integrations/ports as far as possible?
Yes, there is the internal queue of the agent, which is a memory queue, the size of the queue will depend on the output configuration, what is your output? Elasticsearch? You can change the size of the queue by changing to one of the pre-defined outputs or using a custom configuration, more information can be found here.
Once the queue is full, the input will stop accepting new events until the current events are processed, this does not depend only on the specs of the host running the agent, they can be all underutilized if the destionation can not keep up with the ingestion rate.
So, if your output is Elasticsearch and it cannot index data as fast as you receive the events, then it will apply some backpressure into the Agent, which will then tell the input to slow down, in the case of the UDP input, this means events will be dropped.
To troubleshoot and fix this you need to troubleshoot the entire ingestion flow, from the source to the destination.
That's interesting, but I'm still concerned that increasing the queue size will only handle peak loads. Currently, according to the metrics, thousands of UDP packets are being dropped every minute every day. Yes, they are transmitted to Elasticsearch nodes. The agent is managed by fleet and can communicate with multiple data nodes. But even here, the data nodes are far from being fully utilized in terms of CPU, heap, RAM, network, and IO.
I can't see any clues in either the agent logs or the node logs. There are approximately 0.5 million documents and 0.7 million UDP drops per minute over this CEF integration, depending on the time of day.
What is the total expected volume of documents and packers per minute?
What is the size of the container or host the agent is running on.
What version of the stack and integration
What is your entire configuration, including the output setting?s
What is the throughput you're getting
What is the throughput you're expecting
The document that @leandrojmp shared, the settings are not just about queue length... There are workers and batch size that can have significant effects on througput
Workers are threads, basically so you may need to tune them up
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.