DNS Filter Lookup Failures since 6.7 upgrade

Hi Guys,

Hoping someone can offer some guidance. Since our LS upgraded to 6.7.0, we've consistently seen issues with LS resolving DNS entries which appears to build over time. E.g -

[logstash.filters.dns ] DNS: timeout on resolving address. {:field=>"[destination-address]", :value=>"x.x.x.x"}

It's normal for us to see 'some' addresses which are unresolvable, but what appears to be happening is over time (say 1-2 hours) LS will log more and more problems until eventually it stops processing ingested messages causing ES/Kibana dashboards to miss data.

If we restart the LS service, it's fine again, but only for a few hours. It doesn't matter whether it's the middle of peak period, or in the middle of the night.

As I said, before the upgrade it was fine, and locally on the LS server, we can run nslookup/dig to resolve addresses just-fine.

We're ingesting logs from various sources - Bro IDS and Blue Coat via rsyslog or filebeat, along with Juniper, Meru, Pulse Secure logs via Syslog

Any ideas? Anyone else having a similar issue?

There's no error logged for this by LS. It 'just' stops processing until restart. With Metricbeat and Filebeat, thats fine, but with syslog, we end up missing data.

Cheers
Andy

Hi All,

For evidence -

[2019-03-29T21:51:34,061][WARN ][logstash.filters.dns     ] DNS: timeout on resolving address. 
{:field=>"[destination-address]", :value=>"10.250.37.62"}
root@logstash00:/var/log/logstash# nslookup 10.250.37.62
Server:		10.250.91.67
Address:	10.250.91.67#53

62.37.250.10.in-addr.arpa	name = sar-xxxx.xxxxx.

Andy

We run into the same problem after updating to ELK 6.7
About one hour after restarting logstash we get the same error messages for (it looks like) every logline processed by logstash. So we're seeing quite huge delays. The only thing that works so far is to disable dns ptr lookups completly in our filters.
After restarting logstash all runs smooth for a while. We first though that the issue might be this https://github.com/logstash-plugins/logstash-filter-dns/issues/40 but it seems our resolv.rb is already correctly patched

Got the same issue after 6.7.0 update.
6.7.1 update didn't fix the issue.

[2019-04-17T10:18:20,265][WARN ][logstash.filters.dns ] DNS: timeout on resolving address. {:field=>"host", :value=>"172.23.6.9"}

root@syslog-core01:/var/log# host 172.23.6.9
9.6.23.172.in-addr.arpa domain name pointer host2.core.example.com.

logstash-filter-dns plugin version: 3.0.12

Even 7.0.0 update didn't fix it.

Any suggestions/idea?

Heads up to say that I believe I just found the cause of this regression and it relates to a library update to the resolver code. I will be following up shortly but for now the only workaround is to downgrade to the latest 6.6 series (6.6.2 as of now). Just downgrading the dns filter plugin version will not help.

2 Likes

Same Problem here with LS Version 6.7.1. We planed a elastic upgrade to 7.0 in production today but after this problem we would not do that.

Yes, this affects all versions starting at and after v6.7, including v7 and the only workaround for now to downgrade to v6.

Progress can be tracked in https://github.com/logstash-plugins/logstash-filter-dns/issues/51

I may have a temporary workaround I'd like to validate, any feedback appreciated.

If you are using the nameserver option of the dns filter with multiple hosts configured OR you are not using the nameserver option but have multiple servers configured in /etc/resolv.conf

then as a tentative temporary workaround you can try to use a SINGLE server in either the nameserver option or in your /etc/resolv.conf and see if it solves.

Hi @colinsurprenant,

Thanks for the updates, and confirmation there's a problem to fix.

I've applied your logic to our test cluster, and am waiting for some log entries. We have multiple entires inn resolv.conf, so I've switched the LS conf files to use only one on nameserver.

I have had some, but no-where near as many so far as before. Probably needs a couple of hours though.

One oddity is this, however -

[2019-04-25T20:19:43,370][WARN ][logstash.filters.dns     ] DNS: timeout on resolving address. {:field=>"[destination-address]", :value=>"209.112.114.33"}

 root@elk00:~# nslookup
> 209.112.114.33
;; Truncated, retrying in TCP mode.
33.114.112.209.in-addr.arpa	name = k4.nstld.com.
33.114.112.209.in-addr.arpa	name = l4.nstld.com.
33.114.112.209.in-addr.arpa	name = a22.verisigndns.com.
33.114.112.209.in-addr.arpa	name = f4.nstld.com.
33.114.112.209.in-addr.arpa	name = ns2.euro909.com.
33.114.112.209.in-addr.arpa	name = a23.verisigndns.com.
33.114.112.209.in-addr.arpa	name = ns0.netnames.net.
33.114.112.209.in-addr.arpa	name = ns1.netnames.net.
33.114.112.209.in-addr.arpa	name = ns1.ascio.net.
33.114.112.209.in-addr.arpa	name = ns2.domainnetwork.se.
33.114.112.209.in-addr.arpa	name = ns3.ascio.net.
33.114.112.209.in-addr.arpa	name = ns2.dnsvisa.com.
33.114.112.209.in-addr.arpa	name = g4.nstld.com.
33.114.112.209.in-addr.arpa	name = a21.verisigndns.com.
33.114.112.209.in-addr.arpa	name = ns2.webipdns.com.au.
33.114.112.209.in-addr.arpa	name = ns3.netnames.net.
33.114.112.209.in-addr.arpa	name = ns5.netnames.net.
33.114.112.209.in-addr.arpa	name = a2.verisigndns.com.
33.114.112.209.in-addr.arpa	name = pdns1.cscdns.net.
33.114.112.209.in-addr.arpa	name = indom30.indomco.fr.
33.114.112.209.in-addr.arpa	name = ns7.netnames.net.
33.114.112.209.in-addr.arpa	name = indom10.indomco.com.
33.114.112.209.in-addr.arpa	name = dns1.cscdns.net.
33.114.112.209.in-addr.arpa	name = indom130.indomco.org.

Not sure if the switch to TCP might have caused the filter a problem?

Andy

Thanks @millap for the followup. The timeout on 209.112.114.33 seems unrelated. When this bug is triggered, most if not all requests will timeout, occasional timeout is normal.

Incoming fix in https://github.com/logstash-plugins/logstash-filter-dns/pull/52

The dns filter v3.0.13 has just been published to address this issue. You can update this plugins using

bin/logstash-plugin update logstash-filter-dns

Please let us know how this new version is working for you.
Colin

1 Like

After updating to v3.0.13 on 30th of april the problem did not occur anymore. I am going to monitor this further.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.