Hoping someone can offer some guidance. Since our LS upgraded to 6.7.0, we've consistently seen issues with LS resolving DNS entries which appears to build over time. E.g -
[logstash.filters.dns ] DNS: timeout on resolving address. {:field=>"[destination-address]", :value=>"x.x.x.x"}
It's normal for us to see 'some' addresses which are unresolvable, but what appears to be happening is over time (say 1-2 hours) LS will log more and more problems until eventually it stops processing ingested messages causing ES/Kibana dashboards to miss data.
If we restart the LS service, it's fine again, but only for a few hours. It doesn't matter whether it's the middle of peak period, or in the middle of the night.
As I said, before the upgrade it was fine, and locally on the LS server, we can run nslookup/dig to resolve addresses just-fine.
We're ingesting logs from various sources - Bro IDS and Blue Coat via rsyslog or filebeat, along with Juniper, Meru, Pulse Secure logs via Syslog
Any ideas? Anyone else having a similar issue?
There's no error logged for this by LS. It 'just' stops processing until restart. With Metricbeat and Filebeat, thats fine, but with syslog, we end up missing data.
We run into the same problem after updating to ELK 6.7
About one hour after restarting logstash we get the same error messages for (it looks like) every logline processed by logstash. So we're seeing quite huge delays. The only thing that works so far is to disable dns ptr lookups completly in our filters.
After restarting logstash all runs smooth for a while. We first though that the issue might be this https://github.com/logstash-plugins/logstash-filter-dns/issues/40 but it seems our resolv.rb is already correctly patched
Heads up to say that I believe I just found the cause of this regression and it relates to a library update to the resolver code. I will be following up shortly but for now the only workaround is to downgrade to the latest 6.6 series (6.6.2 as of now). Just downgrading the dns filter plugin version will not help.
I may have a temporary workaround I'd like to validate, any feedback appreciated.
If you are using the nameserver option of the dns filter with multiple hosts configured OR you are not using the nameserver option but have multiple servers configured in /etc/resolv.conf
then as a tentative temporary workaround you can try to use a SINGLE server in either the nameserver option or in your /etc/resolv.conf and see if it solves.
Thanks for the updates, and confirmation there's a problem to fix.
I've applied your logic to our test cluster, and am waiting for some log entries. We have multiple entires inn resolv.conf, so I've switched the LS conf files to use only one on nameserver.
I have had some, but no-where near as many so far as before. Probably needs a couple of hours though.
One oddity is this, however -
[2019-04-25T20:19:43,370][WARN ][logstash.filters.dns ] DNS: timeout on resolving address. {:field=>"[destination-address]", :value=>"209.112.114.33"}
root@elk00:~# nslookup
> 209.112.114.33
;; Truncated, retrying in TCP mode.
33.114.112.209.in-addr.arpa name = k4.nstld.com.
33.114.112.209.in-addr.arpa name = l4.nstld.com.
33.114.112.209.in-addr.arpa name = a22.verisigndns.com.
33.114.112.209.in-addr.arpa name = f4.nstld.com.
33.114.112.209.in-addr.arpa name = ns2.euro909.com.
33.114.112.209.in-addr.arpa name = a23.verisigndns.com.
33.114.112.209.in-addr.arpa name = ns0.netnames.net.
33.114.112.209.in-addr.arpa name = ns1.netnames.net.
33.114.112.209.in-addr.arpa name = ns1.ascio.net.
33.114.112.209.in-addr.arpa name = ns2.domainnetwork.se.
33.114.112.209.in-addr.arpa name = ns3.ascio.net.
33.114.112.209.in-addr.arpa name = ns2.dnsvisa.com.
33.114.112.209.in-addr.arpa name = g4.nstld.com.
33.114.112.209.in-addr.arpa name = a21.verisigndns.com.
33.114.112.209.in-addr.arpa name = ns2.webipdns.com.au.
33.114.112.209.in-addr.arpa name = ns3.netnames.net.
33.114.112.209.in-addr.arpa name = ns5.netnames.net.
33.114.112.209.in-addr.arpa name = a2.verisigndns.com.
33.114.112.209.in-addr.arpa name = pdns1.cscdns.net.
33.114.112.209.in-addr.arpa name = indom30.indomco.fr.
33.114.112.209.in-addr.arpa name = ns7.netnames.net.
33.114.112.209.in-addr.arpa name = indom10.indomco.com.
33.114.112.209.in-addr.arpa name = dns1.cscdns.net.
33.114.112.209.in-addr.arpa name = indom130.indomco.org.
Not sure if the switch to TCP might have caused the filter a problem?
Thanks @millap for the followup. The timeout on 209.112.114.33 seems unrelated. When this bug is triggered, most if not all requests will timeout, occasional timeout is normal.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.