what exactly does queue_size parameter in UDP input do?
I have a logstash server which is not able to consume all UDP events that are sent to it therefore a lot of UDP events are being dropped.
Documentation says that queue_size is "the number of unprocessed UDP packets you can hold in memory
before packets will start dropping".
I have increased this parameter's value gradually to 1000000 and added more workers using workers => parameter but I did not see any improvements in Recv-Q:
Maybe the total throughput of your logstash configuration is slower than the input rate? Logstash is as fast as the slowest part of the configuration. Maybe your filters are slow? Maybe your outputs are slow?
You can find out the slowest/busiest part of logstash with:
top -p <pid> -H
This will show you something like:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8725 logstash 20 0 9242m 792m 14m S 22.9 0.6 328:18.40 >output
8382 logstash 20 0 9242m 792m 14m S 4.7 0.6 58:40.69 <redis
8416 logstash 20 0 9242m 792m 14m S 3.0 0.6 38:34.98 |worker
8418 logstash 20 0 9242m 792m 14m S 3.0 0.6 38:26.75 |worker
In my case the output plugins are the slowest/busiest component. I guess you will find a component using 100% CPU.
If you have a bottleneck try increasing the number of threads (using the workers plugin parameter or the filterworkers commandline parameter), so your logstash instance can process more events per second.
If that does not improve your throughput you really should think about horizontal scaling: adding more logstash instances. Using UDP will not really work anymore in such a setup. You may want to investigate a queueing middleware like Redis.
Thanks for the reply. Yes, I have checked the workers as you mentioned and found out that input workers are not doing much work but output workers are consuming almost all CPU. Added more workers and more CPU but Logstash wants more I have now 8 cores on each logstash nodes (2 of them) and a lot of output workers, they are processing almost 15.000 events per second (they are behind a F5 loadbalancer) but still a lot of UDP packets are dropped. Will add 2 more nodes and see what happens.
What output(s) are you using if I may ask? Maybe I can suggest some optimisations.
Shot in the dark: if you are using the ES output, you may want to increase your ES index refresh interval. Default is 1sec, which tends to slow down bulk indexing quite a lot. We run it at 5sec.
I am using GELF UDP output. Tried also regular UDP output but did not see any difference in CPU usage and no change in the number of dropped UDP packets.
Funny thing is, Logstash is forwarding everything to Graylog2, which has 2 very lightweight nodes and has no problems at all with processing and sending the messages to Elasticsearch.
2 logstash nodes with 8 CPUs and 8 GB RAM (6 gb Heap) are almost dying (780% CPU usage) when processing 15.000 messages per second while 2 graylog nodes wih 2 CPUs, 4 gb RAM (1 gb heap) are able to process everything and send them to Elasticsearch and do not seem to be overloaded at all!
Using the latest version. I know that UDP is lossy and I can live with a certain amount of packet loss but 50% - 60% packet loss is not something normal even with UDP.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.