Data loss using UDP input plugin

Almost 80% of data is lost when i use UDP input plugin for netflow data. Below is my configuration file.
input {
udp {
queue_size => 50000
port => 9993
type => "netflow"
workers => 4
codec => netflow { versions => [5] }
}
}

output {
kafka {
broker_list => "172.17.33.17:9092"
topic_id => "storm"
producer_type => "async"
batch_num_messages => 50000
queue_buffering_max_messages => 50000
queue_buffering_max_ms => 50
queue_enqueue_timeout_ms => -1
workers => 5
}
}

not sure where i am going wrong. previously i was using default values for all of plugin properties. After increasing some of the buffer sizes saw some improvement.

logstash in running on 4 core machine and all the 4 cores are showing as 80-90% used.

Is the kafka output slowing down and causing the packets to be dropped from the buffers ? or UDP input configuration is wrong ? I am using logstash 1.5.0

It's UDP so it's not guaranteed.
Are you sure it's all reaching LS?

Thanks warkolm, we had run tcpDump tool to capture the UDP traffic on that machine. when we compared the tcpdump collected data and logstash collected data we came to know about this data loss. we used logstash file output for this testing. is there any other way to identify where actually we are missing the data.

Does it happen if you just use a basic UDP input and a simple file output?

its the same with both file output and kafka output.

What sort of throughput are you trying to process?

According to TCP dump : 4 lakh UDP packets per minute, 14 million netflow packets are seen per minute.

That is around 240k messages per second. That sounds like a lot for a single Logstass instance to handle. If you are successfully only capturing only 20% of these events, you are likely to need to spread the load across a larger number of Logstass instances.

yep Christian, i thought about the same but we are currently listening on a port for the UDP traffic..
how can i share it with 2 different logstash instances ?

I am not sure you can have multiple Logstass instances listening to the same port on a single host, but even if you could you might be limited by the resources of the server. What does resource usage look like on the host when you are collecting traffic? Is there anything limiting throughput, e.g. CPU?

You might be able to scale out to multiple instances by using a loadbalancer able to handle UDP or postbly even by setting up DNS round robin.

i have 4 input UDP workers and that machine is 4 core , all the 4 cpu's. 350-360% of cpu is being used.

If it uses that amount of CPU for processing 20% of the traffic, you will need to get a host with more CPU (as that seems to be the limiting factor) or scale out.

Christian dont you think kafka is taking time and we might be missing our data there ?

It is quite possible that the Kafka output plugin is limiting throughput to some extent, but I am not sure exchanging it for some other output plugin would improve performance. Given the gap between the current throughput level and what is required, you will need to scale up and/or out.

You can test the throughput of the Kafka plugin by running a generator
input
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-generator.html
with a dots codec
https://www.elastic.co/guide/en/logstash/current/plugins-codecs-dots.html

That'll give you an idea of your capacity.

sorry for the wrong data, we are actually receiving, 20K UDP packets per minute .. so does one instance of logstash capable for parsing it.

With the above configuration i am able to capture 90% of the data for first 5 mins after that data loss starts. So why is this inconsistency ?

Set your Kafka workers to 1 and see if that helps. You aren't going to get
any more performance by having it larger than 1 due to the parallelism of
Logstash. Using async mode will definitely drop messages if the buffer is
slow. I'd also set queue.buffering.max.ms much higher, like 5000 as that is
going to chew through CPU and could affect your throughput (too many small
batches going out). Set your batch.num.messages to ~ 1/50th of your max so
1000 to balance out the queue buffer max ms being higher.

Try that out and let us know!

thanks Joe, i tried your suggestion, still the same it captures 100% of data for first 6 mins and then falls back to 20% of data capture.

That definitely sounds like a bottleneck somewhere.

Try benchmarking with the dots codec just hitting stdout.
output {
stdout { codec => dots }
}

$ bin/logstash -f test.conf | pv -Wr > /dev/null

That'll tell you if logstash has the throughput.