First off... don't use the Netflow module... use ElastiFlow.
The Netflow Module was based on ElastiFlow 1.0.0 and is waaaaaay behind ElastiFlow, which is at 2.1.0 and extracts a lot more value from the data. I am hoping to release ElastiFlow 2.5.0 within the next 2 weeks, which will include efficiency improvements that will save 20-30% disk space, as well as other improvements and new features. Depending on your use-cases, especially if they are security-related, there is even more that is possible. PM me if you want more details.
Now to the performance issues... those numbers are provided by the developer of the Netflow codec. They aren't mine. They are achievable, but require some attention to a number of things. Also note that those numbers are only for the decoding of the flows. It doesn't include all of the parsing, formatting and enrichment of the data. This is true whether we are talking about the Netflow Module, ElastiFlow or the more advanced solutions we provide. Given this additional load, it is unlikely that you will be able to ingest 5000 flows/sec with only 4 vCPUs. 3000/sec is a more realistic goal, but will be very dependent on your sources.
I have spent A LOT of time figuring out how to get the most performance from the UDP input, which is critical for syslog and especially for flow (Netflow, IPFIX, sFlow) use-cases. I am planning a more complete article on this topic, explaining the "why" behind the following. For now, just trust me and try this...
- Run these commands and add them to a file under /etc/sysctl.d so they are applied at each new start:
sudo sysctl -w net.core.somaxconn=2048
sudo sysctl -w net.core.netdev_max_backlog=2048
sudo sysctl -w net.core.rmem_max=33554432
sudo sysctl -w net.core.rmem_default=262144
sudo sysctl -w net.ipv4.udp_rmem_min=16384
sudo sysctl -w net.ipv4.udp_mem="2097152 4194304 8388608”
- Add these options to your UDP input:
workers => 4 (or however many cores/vCPUs you have)
queue_size => 16384
- In your logstash.yml (or pipeline.yml if that is what you are using) use these settings:
- In the startup.options file change
LS_NICE to 0 and re-run
# Nice level
After making these changes you can (re)start Logstash. You should see a significant boost in throughput.
The other thing to look at is a PCAP of the actual incoming flows. As a device monitors network flows and creates flow records, it will assemble the records into a "flowset" to send to the collector. To improve the efficiency of sending this data, multiple flow records are packed into a single flowset, which is sent within a single UDP packet. You can see this in Wireshark...
However, some devices bundle only a single flow in a flowset...
In this case the second device will have to send 8x the number of packets to transfer the same number of flow records as the first device. Those extra packets must be received by the network interface (more interrupts), buffered through the kernel and passed through the associated buffer locks, and the PDU decoded by the UDP input. Only then can the codec decode the flows themselves. So there is A LOT more overhead on the collector when the device sends only a single flow record per packet.
If your devices behave in this way, you need to discuss it with the vendor and see what options there may be, but don't expect much help. Most likely you will need to plan to deploy additional Logstash instances to deal with this extra packet load.
As you can see there is more to high volume ingest that initially meets the eye. Besides all of the value that our turnkey solutions provide, this is one of the reasons our customers engage us... our experience building out high-scale ingest architectures. You will find Elasticsearch is pretty simple, as is Kibana. The challenge is always the high scale ingestion of high quality data.
Let me comment on a few of your other points before I close...
Doubling the number vCPUs on the LS node to 8. This had no discernible increase in events.
Increasing batch size to 250 and workers to 8 (while keeping 4vCPU). Had no effect.
Both of these are related to the fact that the
pipeline.workers setting is for filters and outputs, not inputs. Adding the
workers setting within the UDP input is necessary to engage multiple processors. The default is only 2.
Creating another LS node, putting an nginx loadbalancer in front and sending to two LS nodes. This increased events by about 50% (did not double the number of events as i expected).
This is because the NGiNX host/VM was likely suffering from the same default linux tuning, that affected Logstash VM. You would need the same kernel parameter changes mentioned in #1 above on the NGiNX host/VM.
top shows CPUs are all getting equal load, however /proc/interrupts shows one CPU handling a lot of the interrupts.
This is because a single kernel thread binds to a single NIC receive queue. If you have a server-class NIC with multiple receive queues, you can spread the load across multiple cores, but only if you have multiple listeners, which would mean you would need to use multiple ports for them each to listen on. Of course you are in a VM, and the hypervisor gives you only a single receive queue per vNIC so this isn't an option even if it could be easily done in Logstash. This is why adding CPUs (and workers) doesn't scale as expected. The threads must all compete for the single lock which governs access to the receive buffer. More application threads means more lock contention, and quickly diminishing returns.
netstat shows A LOT of packet received errors. Presumably this is a symptom of my LS not being able to process what im sending.
Correct! This is one reason why using something like Redis and a multi-tier Logstash architecture is a good idea. The data can be quickly pulled from the input buffers and stuffed into Redis (or Kafka), from which one or more processing instances can grab it and complete the parsing, formatting and enrichment of the data. This is a standard feature of our Synesis solution, which allows you to work with flows, access and security logs in a seamlessly integrated user experience.
3 x ES VM Nodes, all Master, all data, each with 16GB RAM (8GB heap); 2 CPU; 250GB SSD
You should realize that you will need A LOT more disk space if you are planning to collect 15K flows/sec.
500 bytes x 15000 x 86400 = 640GB per day. (and you don't have any replicas yet)
Hopefully this all helps. Please report back the results of the above changes. I would like to know how much they help.
Robert Cowart (email@example.com)
True Turnkey SOLUTIONS for the Elastic Stack