Logstash udp input queue_size

Hello There,

what exactly does queue_size parameter in UDP input do?

I have a logstash server which is not able to consume all UDP events that are sent to it therefore a lot of UDP events are being dropped.

Documentation says that queue_size is "the number of unprocessed UDP packets you can hold in memory
before packets will start dropping".

I have increased this parameter's value gradually to 1000000 and added more workers using workers => parameter but I did not see any improvements in Recv-Q:

[~] # netstat -c --udp -an | grep 12201
udp 268395328 0 :::12201 :::*
udp 268430136 0 :::12201 :::*
udp 268430136 0 :::12201 :::*
udp 268430136 0 :::12201 :::*
udp 268353192 0 :::12201 :::*
udp 268428304 0 :::12201 :::*
udp 268430136 0 :::12201 :::*
udp 268367848 0 :::12201 :::*
udp 268400824 0 :::12201 :::*
udp 268397160 0 :::12201 :::*

UDP receive queue is almost always full and most of the events are still being dropped (still have free memory).

Any ideas?

Regards

Maybe the total throughput of your logstash configuration is slower than the input rate? Logstash is as fast as the slowest part of the configuration. Maybe your filters are slow? Maybe your outputs are slow?

You can find out the slowest/busiest part of logstash with:

top -p <pid> -H

This will show you something like:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8725 logstash  20   0 9242m 792m  14m S 22.9  0.6 328:18.40 >output
 8382 logstash  20   0 9242m 792m  14m S  4.7  0.6  58:40.69 <redis
 8416 logstash  20   0 9242m 792m  14m S  3.0  0.6  38:34.98 |worker
 8418 logstash  20   0 9242m 792m  14m S  3.0  0.6  38:26.75 |worker

In my case the output plugins are the slowest/busiest component. I guess you will find a component using 100% CPU.

If you have a bottleneck try increasing the number of threads (using the workers plugin parameter or the filterworkers commandline parameter), so your logstash instance can process more events per second.

If that does not improve your throughput you really should think about horizontal scaling: adding more logstash instances. Using UDP will not really work anymore in such a setup. You may want to investigate a queueing middleware like Redis.

Hope this helps.

Thanks for the reply. Yes, I have checked the workers as you mentioned and found out that input workers are not doing much work but output workers are consuming almost all CPU. Added more workers and more CPU but Logstash wants more :slight_smile: I have now 8 cores on each logstash nodes (2 of them) and a lot of output workers, they are processing almost 15.000 events per second (they are behind a F5 loadbalancer) but still a lot of UDP packets are dropped. Will add 2 more nodes and see what happens.

What output(s) are you using if I may ask? Maybe I can suggest some optimisations.

Shot in the dark: if you are using the ES output, you may want to increase your ES index refresh interval. Default is 1sec, which tends to slow down bulk indexing quite a lot. We run it at 5sec.

I am using GELF UDP output. Tried also regular UDP output but did not see any difference in CPU usage and no change in the number of dropped UDP packets.

Funny thing is, Logstash is forwarding everything to Graylog2, which has 2 very lightweight nodes and has no problems at all with processing and sending the messages to Elasticsearch.

2 logstash nodes with 8 CPUs and 8 GB RAM (6 gb Heap) are almost dying (780% CPU usage) when processing 15.000 messages per second while 2 graylog nodes wih 2 CPUs, 4 gb RAM (1 gb heap) are able to process everything and send them to Elasticsearch and do not seem to be overloaded at all!

What logstash version are you using? Version 1.5 has some significant performance improvements.

If you're concerned about message and packet loss, why are you using UDP? By definition, UDP is lossy.

Using the latest version. I know that UDP is lossy and I can live with a certain amount of packet loss but 50% - 60% packet loss is not something normal even with UDP.

1 Like

hi, is there any way to resolve it?

thanks.

the only solution I could come up with is adding more resources while increasing input, filter and output workers/threads