UDP/Netflow Performance

Hi there,

Can anyone share experience about server scaling and UDP performance.

my Logstash setup is quite simple:
UDP input with Netflow Codec -> no Filters -> Elasticsearch output via HTTP via bulk API

My current test environment is based on a quite outdated 4Core Xeon (E5320) with 16GB RAM and 10k SAS drives in Raid1 configuration.

I'm collecting about 2k flows per second from one of our edge routers which causes an average system load of 4.0 during daily peaks.
Nearly all of the CPU load ist caused by the Java process Logstash runs in. ElasticSearch only utilizes half of an core in average.

I'm wondering if this is a normal behavior? I noticed that Logstash itself is eligible to process 100k of events per second, so i'm wondering that 2% of this are causing such a high load.
Is JSON serialisation for ES output causing this high load?

Our production environment currently produces daily peaks of 10k flows/sec. My production hardware should be able to process 20k flows/sec. What will be a suitable server configuration here?

Thanks in advance for all replys

best regards
Andreas

What does your input block look like? Also, your output block? Which version of Logstash are you using?

This seems quite low, as I'm able to get 30k events per second with the UDP input plugin as it comes "out of the box." This is with Logstash 1.5, btw.

Hi,

as already mentioned my setup is quite basic.

This is the whole configuration:

input {
    udp {
            port => 9995
            codec => netflow {
                    versions => [5, 9]
            }
            workers => 4
            type => "netflow"
    }
}

filter {
}

output {
    elasticsearch {
            host => "localhost"
            cluster => logstash
            protocol => http
            flush_size => 4000
            index => "netflow-%{+YYYY.MM.dd}"
    }
}

i tried several settings of worker thread in the input section and tested some bulk-api without any significant effort.

Currently i'm using logstash 1.5.0

Thanks for adding the information. This makes it more clear.

  1. flush_size => 4000 is inappropriate with Logstash 1.5, and is probably the bottleneck here. Since 1.5 was released, the output retries any messages which failed to be parsed in the bulk output. 500 is a more appropriate number. The default is 1000. We're in the midst of doing some performance testing with the new retry logic. I believe it will shake out somewhere near 500. In older versions (where no retry logic existed) it was fire and forget. If Elasticsearch failed to index, you were out of luck.
  2. While workers => 4 may make sense for you in your udp input, the default 2 should suffice, but likely you won't need more than 3. The queue_size directive will perhaps help (the default is 2000) if you're not ingesting fast enough. It sounds more like the output is the blocker, though.