Performanceissue with Filebeat and Netflow Input

Hi folks,

we are importing flow data into our 10 Node Elasticsearch cluster via Filebeat netflow Input. The Stack is running on 7.14.0. Unfortunately I am witnessing performance pressure and after all the debugging and analyzing I've done I come to the assumption that the filebeat instance, which is handling the incoming netflow traffic, must be the bottleneck.
I've read multiple times that most of the times Elasticsearch output is the bottleneck and not filebeat, but I see a few indicators which tell me that it's filebeat.

Our filebeat is running on a 8 core VM with 16GB of memory, 10G link and a fast PCIe NVMe SSD - its currently able to index 22.000 flows per second. I estimate that the real flow rate which is hitting the input is about 30.000 flows per second but we plan to increase that to about 70.000 flow per second. Here's the filebeat config:

filebeat.modules:
- module: netflow
  log:
    enabled: true
    var:
      netflow_host: 0.0.0.0
      netflow_port: 2055
      queue_size: 50000
      read_buffer: 100MiB
      detect_sequence_reset: true
      tags: ["netflow", "forwarded"]

output.elasticsearch:
  hosts: ["elastic01.example.com:9200", "elastic02.example.com:9200", "elastic03.example.com:9200", "elastic04.example.com:9200", "elastic05.example.com:9200" ]
  protocol: "https"
  ssl.verification_mode: none
  username: "beats"
  password: "password"
  indices:
    - index: "filebeat-%{[event.module]}"
      when.equals:
        event.module: "netflow"
  #pipeline: dns_geoip-info
  worker: 5
  bulk_max_size: 0
  compression_level: 0


queue:
  mem:
    events: 200000
    #flush.min_events: 4000
    flush.timeout: 2s

processors:
  - dns:
      type: reverse
      fields:
        observer.ip: netflow.exporter.hostname
      success_cache:
        capacity.initial: 1000
        capacity.max: 10000
        min_ttl: 4h

http.enabled: true
http.host: localhost
http.port: 5066

If I'm changing the output to console and pipe it through pv, I can see that quite the same amount of events are emitted:

filebeat -c /home/user/test_filebeat.yml | pv -Warl > /dev/null

If I checkout the flows / events in Kibana, I can see that there is a gap building up between the timestamps netflow.exporter.timestamp and event.created:


From what I understand is that "event.created" is the timestamp when filebeat has finished with processing the event. The difference between those two timestamps is increasing to about 10 minutes if filebeat runs for a few hours. After about 10 minutes the gap does not increase anymore, I guess thats the point where filebeat is dropping events.

From my point of view I would expect to see dropped events inside the stack monitoring of that beat instance, but I don't. This is a screenshot from the collected metrics (last 30 minutes):

So at this point I don't have an idea what else I can try to increase throughput. I don't see any kind of typical performance issues like too slow cpu, lack of memory, IO etc. So maybe some of you guys have an idea of what I can try. :slightly_smiling_face:

My first inclination is that it's the DNS processor you have. That is going to wait for responses before publishing the events which can slow things down dramatically. Have you tried benchmarking without it?

Yes, I had the same thought and tried it without the DNS processor = no difference.

To be honest I would have been surprised if that would have made a difference because currently we are exporting only from 7 sources and TTL is set quite high. Additionally I've ran tcpdump on port 53 and saw just a few packets. So I guess the cache is working as expected.

The next thing I can think of is increase the number of workers? Maybe this will help How to Tune Elastic Beats Performance: A Practical Example with Batch Size, Worker Count, and More | Elastic Blog?

Unfortunately not, 5 workers seem to be the "sweet spot" currently. I've now installed a second Filebeat VM running on the same hardware and I'm reaching the same PPS. So currently I'm indexing with about 45k/s

Well, thats not really a solution having to setup a new VM every 20k events :laughing:

Unfortunately I've still not found the reason for the 20k/s limit. My solution for the moment is that I run multiple filebeat instances on the same machine via docker. Currently I'm at 4 docker containers running filebeat with netflow input. They are able to collect at least 80k/s which should be sufficient for us for the moment.

As this is only a workaround, I would still appreciate hints of how it could be possible to achive higher throughput with just a single instance.