DNS bursts when elasticsearch become unreachable


(Cyril Auburtin) #1

Hi, I'm having very very frequent DNS requests when filebeat can't reach elasticsearch

This happens in a docker swarm (ran on on google-cloud compute), when one manager node (running elasticsearch) is restarted, then other nodes (which all have filebeat running) start to send these DNS reqs (using sudo tcpdump -i eth0 udp port 53 ran from one of the other nodes here it's swarm-dev-2

09:23:32.966453 IP swarm-dev-2.c.ula.internal.52664 > metadata.google.internal.domain: 24892+ AAAA? elasticsearch.c.ula.internal. (60)
09:23:32.966748 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.52664: 24892 NXDomain 0/1/0 (149)
09:23:32.966754 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.60436: 47756 NXDomain 0/1/0 (149)
09:23:32.967533 IP swarm-dev-2.c.ula.internal.58947 > metadata.google.internal.domain: 52930+ A? elasticsearch.google.internal. (47)
09:23:32.967723 IP swarm-dev-2.c.ula.internal.47546 > metadata.google.internal.domain: 55002+ AAAA? elasticsearch.google.internal. (47)
09:23:32.968074 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.47546: 55002 NXDomain 0/1/0 (136)
09:23:32.968369 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.58947: 52930 NXDomain 0/1/0 (136)
09:23:32.969285 IP swarm-dev-2.c.ula.internal.33177 > metadata.google.internal.domain: 35516+ A? elasticsearch. (31)
09:23:32.969363 IP swarm-dev-2.c.ula.internal.36729 > metadata.google.internal.domain: 17425+ AAAA? elasticsearch. (31)
09:23:32.969686 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.36729: 17425 NXDomain 0/1/0 (106)
09:23:32.970014 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.33177: 35516 NXDomain 0/1/0 (106)
09:23:32.970768 IP swarm-dev-2.c.ula.internal.44048 > metadata.google.internal.domain: 59875+ A? elasticsearch.c.ula.internal. (60)
09:23:32.970850 IP swarm-dev-2.c.ula.internal.46860 > metadata.google.internal.domain: 65460+ AAAA? elasticsearch.c.ula.internal. (60)
09:23:32.971139 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.46860: 65460 NXDomain 0/1/0 (149)
09:23:32.971274 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.44048: 59875 NXDomain 0/1/0 (149)
09:23:32.971924 IP swarm-dev-2.c.ula.internal.42604 > metadata.google.internal.domain: 59435+ A? elasticsearch.google.internal. (47)
09:23:32.972008 IP swarm-dev-2.c.ula.internal.54032 > metadata.google.internal.domain: 20011+ AAAA? elasticsearch.google.internal. (47)
09:23:32.972231 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.42604: 59435 NXDomain 0/1/0 (136)
09:23:32.972300 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.54032: 20011 NXDomain 0/1/0 (136)
09:23:32.973230 IP swarm-dev-2.c.ula.internal.34285 > metadata.google.internal.domain: 22846+ A? elasticsearch. (31)
09:23:32.973257 IP swarm-dev-2.c.ula.internal.53586 > metadata.google.internal.domain: 57498+ AAAA? elasticsearch. (31)
09:23:32.973606 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.34285: 22846 NXDomain 0/1/0 (106)
09:23:32.973624 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.53586: 57498 NXDomain 0/1/0 (106)
09:23:32.974365 IP swarm-dev-2.c.ula.internal.37475 > metadata.google.internal.domain: 33496+ A? elasticsearch.c.ula.internal. (60)
09:23:32.974408 IP swarm-dev-2.c.ula.internal.38549 > metadata.google.internal.domain: 57663+ AAAA? elasticsearch.c.ula.internal. (60)
09:23:32.974700 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.38549: 57663 NXDomain 0/1/0 (149)
09:23:32.974703 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.37475: 33496 NXDomain 0/1/0 (149)
09:23:32.975456 IP swarm-dev-2.c.ula.internal.34204 > metadata.google.internal.domain: 15829+ AAAA? elasticsearch.google.internal. (47)
09:23:32.975457 IP swarm-dev-2.c.ula.internal.50591 > metadata.google.internal.domain: 5632+ A? elasticsearch.google.internal. (47)
09:23:32.975723 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.34204: 15829 NXDomain 0/1/0 (136)
09:23:32.975742 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.50591: 5632 NXDomain 0/1/0 (136)
09:23:32.976720 IP swarm-dev-2.c.ula.internal.44968 > metadata.google.internal.domain: 25127+ A? elasticsearch. (31)
09:23:32.976748 IP swarm-dev-2.c.ula.internal.40785 > metadata.google.internal.domain: 47024+ AAAA? elasticsearch. (31)
09:23:32.977084 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.40785: 47024 NXDomain 0/1/0 (106)
09:23:32.977131 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.44968: 25127 NXDomain 0/1/0 (106)
09:23:32.977705 IP swarm-dev-2.c.ula.internal.60127 > metadata.google.internal.domain: 55214+ A? elasticsearch.c.ula.internal. (60)
09:23:32.977750 IP swarm-dev-2.c.ula.internal.60080 > metadata.google.internal.domain: 44805+ AAAA? elasticsearch.c.ula.internal. (60)
09:23:32.978065 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.60080: 44805 NXDomain 0/1/0 (149)
09:23:32.978091 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.60127: 55214 NXDomain 0/1/0 (149)
09:23:32.978765 IP swarm-dev-2.c.ula.internal.46028 > metadata.google.internal.domain: 46264+ A? elasticsearch.google.internal. (47)
09:23:32.978933 IP swarm-dev-2.c.ula.internal.46467 > metadata.google.internal.domain: 40148+ AAAA? elasticsearch.google.internal. (47)
09:23:32.979079 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.46028: 46264 NXDomain 0/1/0 (136)
09:23:32.979320 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.46467: 40148 NXDomain 0/1/0 (136)
09:23:32.980257 IP swarm-dev-2.c.ula.internal.56737 > metadata.google.internal.domain: 31679+ AAAA? elasticsearch. (31)
09:23:32.980290 IP swarm-dev-2.c.ula.internal.43360 > metadata.google.internal.domain: 19303+ A? elasticsearch. (31)
09:23:32.980672 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.56737: 31679 NXDomain 0/1/0 (106)
09:23:32.980673 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.43360: 19303 NXDomain 0/1/0 (106)
09:23:32.981376 IP swarm-dev-2.c.ula.internal.40097 > metadata.google.internal.domain: 17465+ A? elasticsearch.c.ula.internal. (60)
09:23:32.981572 IP swarm-dev-2.c.ula.internal.35038 > metadata.google.internal.domain: 597+ AAAA? elasticsearch.c.ula.internal. (60)
09:23:32.981688 IP metadata.google.internal.domain > swarm-dev-2.c.ula.internal.40097: 17465 NXDomain 0/1/0 (149)

What could be done to avoid this high frequency? (see how close are the timestamps) There are about 1400 DNS requests per second in that scenario.


(Cyril Auburtin) #2

here's filebeat config: (in separate message because of message size limit)

filebeat.inputs:
  - type: docker
    containers:
      path: "/var/lib/docker/containers"
      ids:
        - "*"
    json.keys_under_root: true
    json.ignore_decoding_error: true
    tail_files: true

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]
  pipeline: request-logs
  # index: "filebeat-*"

setup.kibana.host: "kibana:5601"

setup.template.name: "filebeat"
# setup.template.fields: "fields.yml"
setup.template.overwrite: true
setup.template.append_fields:
  - name: err_msg
    type: keyword
  - name: err_stack
    type: text
  - name: err_data
    type: text
  - name: req.duration
    type: double
  - name: res.status
    type: integer
  - name: res.length
    type: integer

# set this off, because it puts many templates for visualizations and dashboards (quite noisy)
setup.dashboards.enabled: false
setup.dashboards.index: "filebeat-*"
setup.dashboards.retry.enabled: true
setup.dashboards.retry.interval: 10

processors:
  - add_docker_metadata: ~
  - drop_fields:
      fields: ["input.type", "prospector.type", "docker.container.id", "docker.container.labels.com.docker.stack.namespace", "docker.container.labels.com.docker.swarm.node.id", "docker.container.labels.com.docker.swarm.service.id", "docker.container.labels.com.docker.swarm.service.name", "docker.container.labels.com.docker.swarm.task.id", "docker.container.labels.com.docker.swarm.task.name", "docker.container.labels.com.docker.swarm.task.value"]
  - drop_event:
      when:
        - regexp:
            docker.container.image: '^docker\.elastic\.co\/'

xpack.monitoring.enabled: true

logging.level: warning
logging.metrics.enabled: false

(Pier-Hugues Pellerin) #3

@caub Thats a lot of requests indeed, so what is happening is client detect that the remote host is gone, try to reconnect to another host but do a DNS check first.

I wonder if configuring a backoff.init with a bigger value would help in this case, that would certainly give more time to the node to get back online.

I presume that the Filebeat tries multiple time to reconnect without successs in the above scenario and generates a lot of DNS request.


(Cyril Auburtin) #4

Thanks, here's a simple reproduction for the issue: https://github.com/caub/filebeat-bug (with a video)

I included backoff.init: 20s in filebeat.yml, that didn't change much the result, it just delay the peak

The DNS starts increasing until reaching a maximum and very high frequency, stopping filebeat makes it go down to something reasonable

In the last commit, by adding dns: '' the issue don't happen, that's one solution so.

I'd be interested to know what causes this issue with filebeat's docker input though, It's like DNS failures increase exponentially the logs, but as you can see, I drop events coming from docker.container.image: '^docker\.elastic\.co\/' (hence filebeat)


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.