Dropped UDP packets on Logstash Inputs (3 different UDP inputs)

Netdata Alert: 1m ipv4 udp receive buffer errors | 23650 errors

I have a cluster of three servers that are all part of an ELK (Elasticsearch Logstash Kibana) cluster receiving netflow/sflow/ipfix data. Everything appears to be working fine and without using netdata one would assume it was working perfectly but I'm seeing the following issue:

I've been researching this for the most of my time over the last few days now and I am not making any progress whatsoever. I've tried tuning things with sysctl with absolutely no effect. The same graph pattern continues relentlessly and the RcvBufErrors and InErrors peak at about 700/events per second. Occasionally I'll see a spike or a dip while making changes in controlled manner but the same pattern always prevails with the same peak values.

The values I've tried increasing with sysctl and their current values are:

net.core.rmem_default = 8388608
net.core.rmem_max = 33554432
net.core.wmem_default = 52428800
net.core.wmem_max = 134217728
net.ipv4.udp_early_demux = 0 (was 1)
net.ipv4.udp_mem = 764304	1019072	1528608
net.ipv4.udp_rmem_min = 18192
net.ipv4.udp_wmem_min = 8192
net.core.netdev_budget = 10000
net.core.netdev_max_backlog = 2000

Note I'm also getting the 10min netdev budget ran outs | 5929 events issue as well but this is less of a concern. That's why I've increased net.core.netdev_budget and net.core.netdev_max_backlog described above.

Since I'm using Elastiflow on top of LogStash I've also tried raising the number of workers (from 4 to 8), queue size (from 2048 to 4096) and receive buffer (from 32MB to 64MB) for each of the logstash inputs but I'm not seeing any difference either. I've given plenty of time for the logstash restart and things to reflect the new settings but the issue remains the same although the patterns on the graphs did change somewhat. I see more RAM being used by udp etc but no change on the packet loss situation.

Any ideas on what I can do to find out what I need to change and how to actually determine what they should be set to would be appreciated.

Thanks for reading.

Edit the systemd service file for Logstash, it should be /etc/systemd/system/logstash.service. Change NICE=19 to NICE=0 and restart Logstash.

At a CPU nice level of 19 Logstash is running at the lowest priority and just about any other process will bump it off the CPU. Changing it to a nice level of 0 (the default when no nice level is specified) should significantly increase the throughput of Logstash and reduce UDP packet loss.

re-nicing a process to a higher priority only really helps if the machine doesn't have sufficient CPU resource and logstash is contending with other processes to get on the CPU.

I have a 56 core machine with CPU to burn, renicing the logstash process made no difference.

The fact is that the UDP input for Logstash is not designed to scale well across multiple CPUs. There is a well documented rate of diminishing returns, that is largely related to kernel buffer contention as well as the fact that Java doesn't pin the worker threads to a specific core. IMO, 4 workers is the sweet spot between throughput and operational overhead. In benchmarking the Logstash Netflow codec, There is almost no gain above 16 cores.

If I had a 56 core machine and needed to maximize throughput I would run multiple instances of Logstash, fronted by NGiNX as a load balancer.

Where are these well documented rates of diminishing returns ? That would be useful reference at the moment.

We're going to adopt a "many small Vms" approach rather than a "few large physicals" approach - having reference to explain why would be useful.

Thanks

You can see the results of the testing completed by the maintainer of the Netflow codec here.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.