Filebeat: Failed to connect to backoff async dial tcp connect connection refused

I've gone through other posts with this same issue, and I have come up empty handed. I've set up ELK rigs before, and this is the first time encountering this problem.

General Info:
*New cluster with 4 nodes. 1 for kibana, 3 for E and L.
*v7.2.1
*each node is 32GB, 4VPU machine with SSD for both OS and data vols.
*CentOS 7.7.1908
*All VMs are on same subnet in an Azure VNET. There are no firewall rules between any nodes. Iptables and selinux are disabled.
*TLS is not configured anywhere

The cluster is in dev and I'm only ingesting a small amount of logs. Around 2GB per day. Mostly syslog and audit/secure logs.

Logs are appearing in Elasticsearch, and I can interact with them in Kibana. Yet I keep getting the following error from filebeat.

Jan 26 02:38:17 elk-nodes-0.int filebeat[25793]: 2020-01-26T02:38:17.861Z        ERROR        logstash/async.go:256        Failed to publish events caused by: read tcp 10.1.3.4:45488->10.1.3.4:5044: read: connection reset by peer
Jan 26 02:38:17 elk-nodes-0.int filebeat[25793]: 2020-01-26T02:38:17.862Z        ERROR        logstash/async.go:256        Failed to publish events caused by: client is not connected

I have tried playing with various config settings in both Filebeat and Logstash, and so far I haven't had any luck. I've also tried removing 2 of the 3 nodes and so I'm just communicating with the same machine. I can telnet to the appropriate ports w/o issue. So I'm really baffled as to what I'm doing wrong. And given that I do see entries in Elasticsearch indexes, could this all be a false positive? This rig MUST pass an audit so I cannot lose any logs, therefore I would rather figure this out than just assume it's all OK.

thx in advanced for any help on this!

here is my filebeat.yml. the system module is enabled too for syslog/secure log files:

filebeat.inputs:
- type: log
  enabled: true
  timeout: 300s
  paths:
    - /var/log/fail2ban.log
    - /var/log/clamav*.log

- type: log
  enabled: true
  timeout: 300s
  paths:
    - /var/ossec/logs/alerts/alerts.log
  json.keys_under_root: true
  fields: {log_type: osseclogs}

filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: true

setup.template.settings:
  index.number_of_shards: 1

setup.kibana:
  host: "10.1.2.8:5601"

output.logstash:
  hosts: ["10.1.3.4:5044"]
  loadbalance: false
  worker: 1
  bulk_max_size: 1024
  slow_start: true
  backoff.init: 5s

queue.mem:
  events: 4096
  flush.min_events: 512
  flush.timeout: 5s

fields: {env: "test", role:"elk-node"}
tags: ["test"]

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~

filebeat.shutdown_timeout: 10s

logging.level: warning
logging.to_files: true
logging.to_syslog: false
logging.files:
  path: /var/log/filebeat
  name: filebeat.log
  keepfiles: 7
  rotateeverybytes: 20971520
  permissions: 0644
  rotateonstartup: true

Hi!

It looks like a connection issue (from Filebeat to Logstash) that occurs from time to time in your setup. This is why you see data in your Elasticsearch. Having said this, could you confirm that after an error has been occurred, you will see events again? Does Filebeat crash? I guess no.

If Filebeat continues shipping events after those errors then you will not lose any data since Filebeat will send them after the connection is established again.

Thanks!

First, Thank you for the reply!

My comments:

  1. "it looks connection issue (from Filebeat to Logstash".
    reply: even on a one node cluster? i modified the config to take all other nodes out. and the amount of logs i'm ingesting is only like 1MB/hour. The VM has 32GB of RAM, and 4VPUs that are mostly idle. I have approx 150 VMs running in this cloud env and have never noticed any other networking issue this severe before. I'm not saying it's not possible, it's just highly suspect to me.

  2. "you will see events again? Does Filebeat crash? "
    reply: correct, events keep coming in after the error, and no the process does not crash.

So that's it then? I just ignore this and chalk it up to a ghosts and move on? I'm willing to try anything you can recommend to get at the bottom of it. Any debug/strace or whatever that might be helpful to you/me. Please lmk.

i'm not completely out of the woods yet but here was a major part of my problem.
this line in filebeat.yml
fields: {env: "corp", role:"my_role"}
needs to be this:
fields: {env: "corp", role: "my_role"}

can you see the difference? yup, there's a space in the second one. both pass filebeat config test and yamllint, yet the first one w/o the space actually causes filebeat to barf as it processes the yml config file. filebeat should fail to start when this happens! instead it just kind of stops at the barf spot and runs with partial config.

i lost days of time trying to figure this out. it's hard to take ELK seriously right now.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.