Only get 900 netflow events/sec from 4 vCPU LS node

Hi,

I'm using the netflow module [ELK 6.2.4] and struggling to squeeze a decent amount flows/events from my LS node. Hoping someone here can see what i may be missing. My (virtualised) setup:

1 x LS VM Node; 8GB RAM (4GB heap); 4CPU

3 x ES VM Nodes, all Master, all data, each with 16GB RAM (8GB heap); 2 CPU; 250GB SSD

I have this replicated this in a test environment and am using the Flowalyzer application to generate test flow data.

According to previous Netflow posts from @rcowart , i should expect to get the below throughput.

vCPU UDP workers flows/sec
1 1 2300
2 2 4300
4 4 6700
8 8 9100
16 16 15000
32 32 16000

However, no matter what i try, i cannot get more than ~900events (ie flows)/second from this LS node.

Pumping up the test flow data to about 5000flows/second still only gives me 900event/second in LS, with a CPU utilization of ~75%. I've attached below screenshots of my LS and ES overviews for more detail.

Things ive tried:

  1. Doubling the number vCPUs on the LS node to 8. This had no discernible increase in events

  2. Increasing batch size to 250 and workers to 8 (while keeping 4vCPU). Had no effect.

  3. Creating another LS node, putting an nginx loadbalancer in front and sending to two LS nodes. This increased events by about 50% (did not double the number of events as i expected). However, i dont think this solves whatever my underlying problem is.

Some other info:

  • My LS indexing latency is about 4ms, which seems a lot.

  • top shows CPUs are all getting equal load, however /proc/interrupts shows one CPU handling a lot of the interrupts.

  • netstat shows A LOT of packet received errors. Presumably this is a symptom of my LS not being able to process what im sending.

  • My elasticsearch doesn't seemed strained, so i dont think there is any backup of events, but i may be missing something. My ES indexing latency hovers consistently around 0.3ms, which seems pretty good, right?

I appreciate the obvious answer here is to do more horizontal scaling, but given the above suggested throughput from 4CPU (6700flows/second) i should be able to get a lot more out of my current node.

If the above stats are right, and one vCPU can produce 1000flows/sec, then im massively underpowered somewhere.

My production environment is likely to be about 10-15000flows/second so want to be sure what setup im going to need to before we delve into that.

Interested in any thoughts people have on this, thanks.

15minute overviews:

First off... don't use the Netflow module... use ElastiFlow.

The Netflow Module was based on ElastiFlow 1.0.0 and is waaaaaay behind ElastiFlow, which is at 2.1.0 and extracts a lot more value from the data. I am hoping to release ElastiFlow 2.5.0 within the next 2 weeks, which will include efficiency improvements that will save 20-30% disk space, as well as other improvements and new features. Depending on your use-cases, especially if they are security-related, there is even more that is possible. PM me if you want more details.

Now to the performance issues... those numbers are provided by the developer of the Netflow codec. They aren't mine. They are achievable, but require some attention to a number of things. Also note that those numbers are only for the decoding of the flows. It doesn't include all of the parsing, formatting and enrichment of the data. This is true whether we are talking about the Netflow Module, ElastiFlow or the more advanced solutions we provide. Given this additional load, it is unlikely that you will be able to ingest 5000 flows/sec with only 4 vCPUs. 3000/sec is a more realistic goal, but will be very dependent on your sources.

I have spent A LOT of time figuring out how to get the most performance from the UDP input, which is critical for syslog and especially for flow (Netflow, IPFIX, sFlow) use-cases. I am planning a more complete article on this topic, explaining the "why" behind the following. For now, just trust me and try this...

  1. Run these commands and add them to a file under /etc/sysctl.d so they are applied at each new start:
sudo sysctl -w net.core.somaxconn=2048
sudo sysctl -w net.core.netdev_max_backlog=2048
sudo sysctl -w net.core.rmem_max=33554432
sudo sysctl -w net.core.rmem_default=262144
sudo sysctl -w net.ipv4.udp_rmem_min=16384
sudo sysctl -w net.ipv4.udp_mem="2097152 4194304 8388608”
  1. Add these options to your UDP input:
workers => 4 (or however many cores/vCPUs you have)
queue_size => 16384
  1. In your logstash.yml (or pipeline.yml if that is what you are using) use these settings:
pipeline.batch.size: 512
pipeline.batch.delay: 250
  1. In the startup.options file change LS_NICE to 0 and re-run system-install:
# Nice level
LS_NICE=0

After making these changes you can (re)start Logstash. You should see a significant boost in throughput.

The other thing to look at is a PCAP of the actual incoming flows. As a device monitors network flows and creates flow records, it will assemble the records into a "flowset" to send to the collector. To improve the efficiency of sending this data, multiple flow records are packed into a single flowset, which is sent within a single UDP packet. You can see this in Wireshark...

However, some devices bundle only a single flow in a flowset...

In this case the second device will have to send 8x the number of packets to transfer the same number of flow records as the first device. Those extra packets must be received by the network interface (more interrupts), buffered through the kernel and passed through the associated buffer locks, and the PDU decoded by the UDP input. Only then can the codec decode the flows themselves. So there is A LOT more overhead on the collector when the device sends only a single flow record per packet.

If your devices behave in this way, you need to discuss it with the vendor and see what options there may be, but don't expect much help. Most likely you will need to plan to deploy additional Logstash instances to deal with this extra packet load.

As you can see there is more to high volume ingest that initially meets the eye. Besides all of the value that our turnkey solutions provide, this is one of the reasons our customers engage us... our experience building out high-scale ingest architectures. You will find Elasticsearch is pretty simple, as is Kibana. The challenge is always the high scale ingestion of high quality data.

Let me comment on a few of your other points before I close...

Doubling the number vCPUs on the LS node to 8. This had no discernible increase in events.

Increasing batch size to 250 and workers to 8 (while keeping 4vCPU). Had no effect.

Both of these are related to the fact that the pipeline.workers setting is for filters and outputs, not inputs. Adding the workers setting within the UDP input is necessary to engage multiple processors. The default is only 2.

Creating another LS node, putting an nginx loadbalancer in front and sending to two LS nodes. This increased events by about 50% (did not double the number of events as i expected).

This is because the NGiNX host/VM was likely suffering from the same default linux tuning, that affected Logstash VM. You would need the same kernel parameter changes mentioned in #1 above on the NGiNX host/VM.

top shows CPUs are all getting equal load, however /proc/interrupts shows one CPU handling a lot of the interrupts.

This is because a single kernel thread binds to a single NIC receive queue. If you have a server-class NIC with multiple receive queues, you can spread the load across multiple cores, but only if you have multiple listeners, which would mean you would need to use multiple ports for them each to listen on. Of course you are in a VM, and the hypervisor gives you only a single receive queue per vNIC so this isn't an option even if it could be easily done in Logstash. This is why adding CPUs (and workers) doesn't scale as expected. The threads must all compete for the single lock which governs access to the receive buffer. More application threads means more lock contention, and quickly diminishing returns.

netstat shows A LOT of packet received errors. Presumably this is a symptom of my LS not being able to process what im sending.

Correct! This is one reason why using something like Redis and a multi-tier Logstash architecture is a good idea. The data can be quickly pulled from the input buffers and stuffed into Redis (or Kafka), from which one or more processing instances can grab it and complete the parsing, formatting and enrichment of the data. This is a standard feature of our Synesis solution, which allows you to work with flows, access and security logs in a seamlessly integrated user experience.

3 x ES VM Nodes, all Master, all data, each with 16GB RAM (8GB heap); 2 CPU; 250GB SSD

You should realize that you will need A LOT more disk space if you are planning to collect 15K flows/sec.

500 bytes x 15000 x 86400 = 640GB per day. (and you don't have any replicas yet)

Hopefully this all helps. Please report back the results of the above changes. I would like to know how much they help.

Rob

Robert Cowart (rob@koiossian.com)
www.koiossian.com
True Turnkey SOLUTIONS for the Elastic Stack

1 Like

Fantastic response Rob, thank you.

I'll digest everything you've said, try the amendments and report back. I'll likely end up tearing down my dev environment and give Elastiflow a shot on a fresh install.

I'll also PM you regarding the security aspects of Elastiflow, am interested to know more.

And yes, it's starting to dawn on me just how much data we're potentially going to be pulling in. Will be rethinking that 30 day retention period!

Thanks again for the detailed response, much appreciated

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.