DNS Filter

I setup a filter to do some internal DNS resolution but it seems that when I give it a proper DNS server, it stops processing events. If I give it a bogus DNS server I, obviously, get non-stop resolution failures in the log but it moves along just fine. Any ideas?

dns {
  action => "replace"
  max_retries => 0
  nameserver => ["10.244.10.231"]
  reverse => ["[event_data][IpAddress]"]
  timeout => 0.5
  failed_cache_size => 100
  failed_cache_ttl => 10
  hit_cache_size => 1024
}

Looks like x6 2.6Ghz processors just isn't able to keep up. Event emission rate appears to be capped at about 400 events/second...odd though as I don't see resource bottlenecking on the server. CPU under 10%, disk writes under 10MB/s, JVM around 50% utilization. Below is my event emission monitor, you can see on the right side where I enabled the DNS filter on two separate occasions and the resulting spike in emission after it's turned off and the queue is processed out. I wonder, if the DNS resolution fails, does it just stick the event back in queue to try again later? That would be strange behavior if so. Not really sure what the issue is. I even tried setting timeout to 0.05 (I think that's 50ms), didn't change anything for the better.

How many pipeline workers? You might want to add more since they may be bound up waiting for responses from the DNS server.

This filter, like all filters, only processes 1 event at a time, so the use of this plugin can significantly slow down your pipeline’s throughput if you have a high latency network. By way of example, if each DNS lookup takes 2 milliseconds, the maximum throughput you can achieve with a single filter worker is 500 events per second (1000 milliseconds / 2 milliseconds).

-- DNS Filter docs

You may also want to look at increasing the size and ttls of the filters hit and miss caches, so you can avoid round-trips to your DNS server for recently-queried addresses.

Default worker count that matches the number of CPU, so 6 in my instance. The cache sizes, are they entries, bytes, kilobytes, etc?? I did increase the failed cache to 1000 and TTL to 120 without any noticeable improvement. CPU usage goes through the roof when the DNS filter is turned off so I'm a little gunshy about increasing the number of workers for fear of negative performance changes when it's handling other processes.

The cache size is a number of hits or misses, and the ttls are in seconds.

A pipeline cannot move faster than the slowest filter in the chain; when a particular filter cannot process fast enough, the filters feeding into it get blocked and wait until a worker can pick up the work. When a pipeline configuration is known to be bound on IOWAIT, increasing the workers so more events can be in flight simultaneously is the normal thing to do.

Do you know how can I identify an IOWAIT bottleneck on a Windows installation?

Unfortunately I am personally useless with Windows :weary:

A brief look tells me that IOWAIT is calculated differently on Windows due to the kernel implementing scheduling in a different way; maybe this will point you in a right direction?

Well I did some experimentation, nothing seems to get the job done. I went up to 12 and 18 workers and increased batch size to 250 with 12 workers. 18 workers appeared to cap me out at about 100 e/s with CPU usage hovering around 50%. Increasing batch size didn't appear to make any difference other than increased JVM allocation.

Oh well, it's not a make or break deal for me, would have been nice for the junior workers that are going to be using this. It could also just be the massive number of resolution failures (mostly repetitive) due to missing reverse zones. I had intended to resolve that anyways so I may revisit this again in the near future. Thanks for your time yaauie.

If you have cache enabled for any DNS filters and you are seeing 100 events/sec, that is inline with my experience as well. Increasing cores or workers won't help.

The way that the DNS filter uses the cache is synchronous, which has the effect that only one lookup at a time can be processed. So basically enabling the cache makes your logstash pipeline the equivalent of a single worker.

So one way to improve throughput is actually to disable the cache by removing the related configuration attributes. This will mean that every lookup is sent to the configured name server, and depending on the latency to you server and the load it can handle, it may make things better or worse. With a DNS server on the same subnet I usually see an increase to about 400 events/sec.

The way to improve throughput is to run a local dnsmasq process on the Logstash box, which listens for requests on localhost. Configure the DNS filter to use localhost for resolution. Anything dnsmasq can't answer will be forwarded to an upstream server. Previous responses will be cached by dnsmasq, providing a quick response. Using dnsmasq will probably get you to around 1000 events/sec depending on a few factors, which would require a much longer response to explain.

NOTE: I know you mentioned that you are on Windows. I have no idea if there is some equivalent for dnsmasq on Windows. You basically need some kind of caching dns forwarder, which you should be able leverage like dnsmasq on Linux.

There was a recent commit to the DNS filter that allows it to cache other types of lookup failures as misses. This is useful especially for security use-cases. However no one will really be able to take advantage of this until the filter is further modified to better handle multiple parallel requests.

1 Like

Wow, lots of good information here, thanks for taking the time. The problem with this particular environment is that there are approximately 5k users and 4k devices. I am looking at doing DNS lookups on a field for each Security log Event ID 4625, 4740, and 4771 that is logged across half a dozen domain controllers. Without going into any discussions surrounding the frequency of these events, 1k/sec still isn't quite enough. I look forward to the DNS filter maturing to the point that I can use it in this environment.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.