Elastic Agent/Beats DNS Processor Caching Bad Performance?

Hello All,

I was recently messing around with an Elastic Agent Netflow integration setup, but was noticing that events were being dropped.

The integration definition looked something like:

inputs:
  - id: netflow-netflow-ce476ed5-52e1-4c1c-9f7c-793a0afeebdb
    name: netflow-production
    revision: 48
    type: netflow
    use_output: a5dd7be0-64fe-11ed-ab11-d7fe3acf785c
    meta:
      package:
        name: netflow
        version: 2.12.0
    data_stream:
      namespace: private.default.production
    package_policy_id: ce476ed5-52e1-4c1c-9f7c-793a0afeebdb
    streams:
      - id: netflow-netflow.log-ce476ed5-52e1-4c1c-9f7c-793a0afeebdb
        data_stream:
          dataset: netflow.log
          type: logs
        expiration_timeout: 30m
        queue_size: 8192
        host: '0.0.0.0:2055'
        max_message_size: 10KiB
        protocols:
          - v1
          - v5
          - v6
          - v7
          - v8
          - v9
          - ipfix
        detect_sequence_reset: true
        tags:
          - netflow
          - forwarded
        publisher_pipeline.disable_host: true
        processors:
          - dns:
              type: reverse
              fields:
                server.ip: server.domain
                client.ip: client.domain
                source.ip: source.domain
                destination.ip: destination.domain
              tag_on_failure: [_dns_reverse_lookup_failed]

Basically, a default Netflow integration with a DNS processor.

This setup wasn't processing too many events, somewhere ~2k e/s, however, it was never really able to keep up.

After doing some troubleshooting, I found that if I removed the DNS processor, the integration was able to keep up with events.

Reading through the DNS processor docs, I figured the likely culprit was high churn in the DNS process cache; though the metrics didn't really indicate this:

"processor": {
    "dns": {
        "6": {
            "<snipped>": {
                "response": {
                    "ptr": {
                        "histogram": {
                            "count": 1286,
                            "max": 390012659,
                            "mean": 76150882.7918288,
                            "median": 1841882.5,
                            "min": 440853,
                            "p75": 135913061.5,
                            "p95": 293640752.45,
                            "p99": 321480634.1100001,
                            "p999": 388944845.0700001,
                            "stddev": 106853753.70482157
                        }
                    }
                },
                "success": 212
            },
            "cache": {
                "hits": 1064,
                "misses": 212
            }
        }
    }
}

I tried doing some cache "tuning" by just setting it to numbers I would know would allow all DNS lookups to be cached:

- dns:
    type: reverse
    fields:
      server.ip: server.domain
      client.ip: client.domain
      source.ip: source.domain
      destination.ip: destination.domain
    tag_on_failure: [_dns_reverse_lookup_failed]
    success_cache.capacity.initial: 25000
    success_cache.capacity.max: 50000
    success_cache.min_ttl: 5m
    failure_cache.capacity.initial: 25000
    failure_cache.capacity.max: 50000
    failure_cache.ttl: 5m

However, even after these changes, I didn't see any improvement. This leads me to believe that there is something performing relatively badly in the DNS Caching feature.

I ended up "fixing" this for now, by routing the events through Logstash with a somewhat similar setup in its DNS filter:

# check if client.ip exists
if [client][ip] {
    # check that client.domain doesn't exist and client.ip isn't a loopback address
    if ![client][domain] and [client][ip] !~ /^(?:127\.|169\.254\.|::1|[fF][cCdD][0-9a-fA-F]{2}:|[fF][eE][89aAbB][0-9a-fA-F]:).*/ {
        # copy client.ip to client.domain
        mutate {
            copy => { "[client][ip]" => "[client][domain]" }
            id => "client_ip_copy_to_client_domain"
        }
        # perform a reverse lookup on client.domain (really client.ip) and replace with result
        dns {
            reverse => ["[client][domain]"]
            action => "replace"
            failed_cache_size => 10240
            failed_cache_ttl => 300
            hit_cache_size => 10240
            hit_cache_ttl => 600
            timeout => 0.5
            id => "client_domain_reverse_dns_lookup"
        }
    }
}

^ The above filter is repeated 4 times for each IP field.

With the above, Logstash is easily able to keep up with lookups even though (to my understanding) Logstash needs to maintain 4 distinct caches (one for each filter).

I would have expected the DNS process on the Beats/Elastic Agent side to have a relatively easy time keeping up with 2k e/s with the level of caching set.

I'm not too familiar with how Beats implements it DNS caching, but having implemented similar caching mechanisms in the past with things like fastcache, I would expect once the entry is stored in the cache the processor would be able to serve ~millions of lookups per second.

Would anyone have any ideas here?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.