Hello All,
I was recently messing around with an Elastic Agent Netflow integration setup, but was noticing that events were being dropped.
The integration definition looked something like:
inputs:
- id: netflow-netflow-ce476ed5-52e1-4c1c-9f7c-793a0afeebdb
name: netflow-production
revision: 48
type: netflow
use_output: a5dd7be0-64fe-11ed-ab11-d7fe3acf785c
meta:
package:
name: netflow
version: 2.12.0
data_stream:
namespace: private.default.production
package_policy_id: ce476ed5-52e1-4c1c-9f7c-793a0afeebdb
streams:
- id: netflow-netflow.log-ce476ed5-52e1-4c1c-9f7c-793a0afeebdb
data_stream:
dataset: netflow.log
type: logs
expiration_timeout: 30m
queue_size: 8192
host: '0.0.0.0:2055'
max_message_size: 10KiB
protocols:
- v1
- v5
- v6
- v7
- v8
- v9
- ipfix
detect_sequence_reset: true
tags:
- netflow
- forwarded
publisher_pipeline.disable_host: true
processors:
- dns:
type: reverse
fields:
server.ip: server.domain
client.ip: client.domain
source.ip: source.domain
destination.ip: destination.domain
tag_on_failure: [_dns_reverse_lookup_failed]
Basically, a default Netflow integration with a DNS processor.
This setup wasn't processing too many events, somewhere ~2k e/s, however, it was never really able to keep up.
After doing some troubleshooting, I found that if I removed the DNS processor, the integration was able to keep up with events.
Reading through the DNS processor docs, I figured the likely culprit was high churn in the DNS process cache; though the metrics didn't really indicate this:
"processor": {
"dns": {
"6": {
"<snipped>": {
"response": {
"ptr": {
"histogram": {
"count": 1286,
"max": 390012659,
"mean": 76150882.7918288,
"median": 1841882.5,
"min": 440853,
"p75": 135913061.5,
"p95": 293640752.45,
"p99": 321480634.1100001,
"p999": 388944845.0700001,
"stddev": 106853753.70482157
}
}
},
"success": 212
},
"cache": {
"hits": 1064,
"misses": 212
}
}
}
}
I tried doing some cache "tuning" by just setting it to numbers I would know would allow all DNS lookups to be cached:
- dns:
type: reverse
fields:
server.ip: server.domain
client.ip: client.domain
source.ip: source.domain
destination.ip: destination.domain
tag_on_failure: [_dns_reverse_lookup_failed]
success_cache.capacity.initial: 25000
success_cache.capacity.max: 50000
success_cache.min_ttl: 5m
failure_cache.capacity.initial: 25000
failure_cache.capacity.max: 50000
failure_cache.ttl: 5m
However, even after these changes, I didn't see any improvement. This leads me to believe that there is something performing relatively badly in the DNS Caching feature.
I ended up "fixing" this for now, by routing the events through Logstash with a somewhat similar setup in its DNS filter:
# check if client.ip exists
if [client][ip] {
# check that client.domain doesn't exist and client.ip isn't a loopback address
if ![client][domain] and [client][ip] !~ /^(?:127\.|169\.254\.|::1|[fF][cCdD][0-9a-fA-F]{2}:|[fF][eE][89aAbB][0-9a-fA-F]:).*/ {
# copy client.ip to client.domain
mutate {
copy => { "[client][ip]" => "[client][domain]" }
id => "client_ip_copy_to_client_domain"
}
# perform a reverse lookup on client.domain (really client.ip) and replace with result
dns {
reverse => ["[client][domain]"]
action => "replace"
failed_cache_size => 10240
failed_cache_ttl => 300
hit_cache_size => 10240
hit_cache_ttl => 600
timeout => 0.5
id => "client_domain_reverse_dns_lookup"
}
}
}
^ The above filter is repeated 4 times for each IP field.
With the above, Logstash is easily able to keep up with lookups even though (to my understanding) Logstash needs to maintain 4 distinct caches (one for each filter).
I would have expected the DNS process on the Beats/Elastic Agent side to have a relatively easy time keeping up with 2k e/s with the level of caching set.
I'm not too familiar with how Beats implements it DNS caching, but having implemented similar caching mechanisms in the past with things like fastcache, I would expect once the entry is stored in the cache the processor would be able to serve ~millions of lookups per second.
Would anyone have any ideas here?