Hi folks,
we are importing flow data into our 10 Node Elasticsearch cluster via Filebeat netflow Input. The Stack is running on 7.14.0. Unfortunately I am witnessing performance pressure and after all the debugging and analyzing I've done I come to the assumption that the filebeat instance, which is handling the incoming netflow traffic, must be the bottleneck.
I've read multiple times that most of the times Elasticsearch output is the bottleneck and not filebeat, but I see a few indicators which tell me that it's filebeat.
Our filebeat is running on a 8 core VM with 16GB of memory, 10G link and a fast PCIe NVMe SSD - its currently able to index 22.000 flows per second. I estimate that the real flow rate which is hitting the input is about 30.000 flows per second but we plan to increase that to about 70.000 flow per second. Here's the filebeat config:
filebeat.modules:
- module: netflow
log:
enabled: true
var:
netflow_host: 0.0.0.0
netflow_port: 2055
queue_size: 50000
read_buffer: 100MiB
detect_sequence_reset: true
tags: ["netflow", "forwarded"]
output.elasticsearch:
hosts: ["elastic01.example.com:9200", "elastic02.example.com:9200", "elastic03.example.com:9200", "elastic04.example.com:9200", "elastic05.example.com:9200" ]
protocol: "https"
ssl.verification_mode: none
username: "beats"
password: "password"
indices:
- index: "filebeat-%{[event.module]}"
when.equals:
event.module: "netflow"
#pipeline: dns_geoip-info
worker: 5
bulk_max_size: 0
compression_level: 0
queue:
mem:
events: 200000
#flush.min_events: 4000
flush.timeout: 2s
processors:
- dns:
type: reverse
fields:
observer.ip: netflow.exporter.hostname
success_cache:
capacity.initial: 1000
capacity.max: 10000
min_ttl: 4h
http.enabled: true
http.host: localhost
http.port: 5066
If I'm changing the output to console and pipe it through pv, I can see that quite the same amount of events are emitted:
filebeat -c /home/user/test_filebeat.yml | pv -Warl > /dev/null
If I checkout the flows / events in Kibana, I can see that there is a gap building up between the timestamps netflow.exporter.timestamp and event.created:
From what I understand is that "event.created" is the timestamp when filebeat has finished with processing the event. The difference between those two timestamps is increasing to about 10 minutes if filebeat runs for a few hours. After about 10 minutes the gap does not increase anymore, I guess thats the point where filebeat is dropping events.
From my point of view I would expect to see dropped events inside the stack monitoring of that beat instance, but I don't. This is a screenshot from the collected metrics (last 30 minutes):
So at this point I don't have an idea what else I can try to increase throughput. I don't see any kind of typical performance issues like too slow cpu, lack of memory, IO etc. So maybe some of you guys have an idea of what I can try.