Performanceissue with Filebeat and Netflow Input

tomx1 · September 15, 2021, 10:34am

Hi folks,

we are importing flow data into our 10 Node Elasticsearch cluster via Filebeat netflow Input. The Stack is running on 7.14.0. Unfortunately I am witnessing performance pressure and after all the debugging and analyzing I've done I come to the assumption that the filebeat instance, which is handling the incoming netflow traffic, must be the bottleneck.
I've read multiple times that most of the times Elasticsearch output is the bottleneck and not filebeat, but I see a few indicators which tell me that it's filebeat.

Our filebeat is running on a 8 core VM with 16GB of memory, 10G link and a fast PCIe NVMe SSD - its currently able to index 22.000 flows per second. I estimate that the real flow rate which is hitting the input is about 30.000 flows per second but we plan to increase that to about 70.000 flow per second. Here's the filebeat config:

filebeat.modules:
- module: netflow
  log:
    enabled: true
    var:
      netflow_host: 0.0.0.0
      netflow_port: 2055
      queue_size: 50000
      read_buffer: 100MiB
      detect_sequence_reset: true
      tags: ["netflow", "forwarded"]

output.elasticsearch:
  hosts: ["elastic01.example.com:9200", "elastic02.example.com:9200", "elastic03.example.com:9200", "elastic04.example.com:9200", "elastic05.example.com:9200" ]
  protocol: "https"
  ssl.verification_mode: none
  username: "beats"
  password: "password"
  indices:
    - index: "filebeat-%{[event.module]}"
      when.equals:
        event.module: "netflow"
  #pipeline: dns_geoip-info
  worker: 5
  bulk_max_size: 0
  compression_level: 0


queue:
  mem:
    events: 200000
    #flush.min_events: 4000
    flush.timeout: 2s

processors:
  - dns:
      type: reverse
      fields:
        observer.ip: netflow.exporter.hostname
      success_cache:
        capacity.initial: 1000
        capacity.max: 10000
        min_ttl: 4h

http.enabled: true
http.host: localhost
http.port: 5066

If I'm changing the output to console and pipe it through pv, I can see that quite the same amount of events are emitted:

filebeat -c /home/user/test_filebeat.yml | pv -Warl > /dev/null

If I checkout the flows / events in Kibana, I can see that there is a gap building up between the timestamps netflow.exporter.timestamp and event.created:

From what I understand is that "event.created" is the timestamp when filebeat has finished with processing the event. The difference between those two timestamps is increasing to about 10 minutes if filebeat runs for a few hours. After about 10 minutes the gap does not increase anymore, I guess thats the point where filebeat is dropping events.

From my point of view I would expect to see dropped events inside the stack monitoring of that beat instance, but I don't. This is a screenshot from the collected metrics (last 30 minutes):

So at this point I don't have an idea what else I can try to increase throughput. I don't see any kind of typical performance issues like too slow cpu, lack of memory, IO etc. So maybe some of you guys have an idea of what I can try.

legoguy1000 · September 15, 2021, 11:17am

My first inclination is that it's the DNS processor you have. That is going to wait for responses before publishing the events which can slow things down dramatically. Have you tried benchmarking without it?

tomx1 · September 15, 2021, 11:29am

Yes, I had the same thought and tried it without the DNS processor = no difference.

To be honest I would have been surprised if that would have made a difference because currently we are exporting only from 7 sources and TTL is set quite high. Additionally I've ran tcpdump on port 53 and saw just a few packets. So I guess the cache is working as expected.

legoguy1000 · September 15, 2021, 12:12pm

The next thing I can think of is increase the number of workers? Maybe this will help How to Tune Elastic Beats Performance: A Practical Example with Batch Size, Worker Count, and More | Elastic Blog?

tomx1 · September 15, 2021, 2:21pm

Unfortunately not, 5 workers seem to be the "sweet spot" currently. I've now installed a second Filebeat VM running on the same hardware and I'm reaching the same PPS. So currently I'm indexing with about 45k/s

Well, thats not really a solution having to setup a new VM every 20k events

tomx1 · September 23, 2021, 6:47am

Unfortunately I've still not found the reason for the 20k/s limit. My solution for the moment is that I run multiple filebeat instances on the same machine via docker. Currently I'm at 4 docker containers running filebeat with netflow input. They are able to collect at least 80k/s which should be sufficient for us for the moment.

As this is only a workaround, I would still appreciate hints of how it could be possible to achive higher throughput with just a single instance.

techie.antonio · October 6, 2021, 10:20am

We use the new ElastiFlow collector because it is provides much better throughput than Logstash (x16) or Filebeat (x4) on the same hardware and also has a lot more features. We pay for the commercial version but saved money on hardware. The most important thing is the additional information it provides.

system · November 3, 2021, 12:20pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to determine the bottleneck between Filebeat and ES? Elasticsearch	3	221	June 9, 2023
Netflow gaps between filebeat and elasticsearch Beats filebeat	1	360	April 15, 2020
Dropped Netflow packets in filebeat Beats filebeat	2	796	November 3, 2021
Filebeat netflow events per second - losing data? Beats	2	1022	September 5, 2019
PacketBeat timestamp Beats packetbeat	8	1674	June 6, 2017

Performanceissue with Filebeat and Netflow Input

Related topics