Filebeat->logstash connection reset by peer

Hi!

I have several Filebeat containers (one filebeat container per host) in my infrastructure sending logs to three Logstash containers, located on one instance each. Filebeat is configured to Load balance between those, with publish_async: true.

The problem I'm facing is that I get those errors, alot:

10:45:41.176206 sync.go:85: ERR Failed to publish events caused by: read tcp 172.17.0.2:59030->52.50.63.141:5000: read: connection reset by peer
10:45:41.176247 sync_worker.go:167: INFO Error publishing events (retrying): read tcp 172.17.0.2:59030->52.50.63.141:5000: read: connection reset by peer

110:45:41.172658 sync.go:85: ERR Failed to publish events caused by: write tcp 172.17.0.2:60410->52.213.93.139:5000: write: connection reset by peer
10:45:41.172724 sync_worker.go:167: INFO Error publishing events (retrying): write tcp 172.17.0.2:60410->52.213.93.139:5000: write: connection reset by peer

I don't seem to loose any logs since Filebeat establishes connection after a while.

However, if i'm disabling Load balancing on Filebeat, the problem seems to go away. But, I would like to use Load balancing.

I run version 5.0.1 on Filebeat with this conf:

filebeat.prospectors:
- input_type: log
  document_type: syslogMessages
  scan_frequency: 5s
  close_inactive: 1m
  backoff_factor: 1
  backoff: 1s
  paths:
    - /host/var/log/messages

- input_type: log
  document_type: syslogSecure
  scan_frequency: 5s
  backoff_factor: 1
  close_inactive: 1m
  backoff: 1s
  paths:
    - /host/var/log/secure

- input_type: log
  document_type: ecsAgent
  scan_frequency: 5s
  backoff_factor: 1
  close_inactive: 10s
  backoff: 1s
  paths:
    - /host/var/log/ecs/ecs-agent.log.*

- input_type: log
  document_type: docker
  scan_frequency: 1s
  close_inactive: 10m
  backoff_factor: 1
  backoff: 1s
  json.message_key: log
  json.keys_under_root: true
  json.add_error_key: true
  overwrite_keys: true
  paths:
    - /host/var/lib/docker/containers/*/*.log
  multiline.pattern: '^[[:space:]]+|^Caused by:'
  multiline.negate: false
  multiline.match: after
  multiline.timeout: 1s

#================================ General =====================================
publish_async: true
filebeat.idle_timeout: 1s
filebeat.shutdown_timeout: 5s

fields_under_root: true
fields:
  accountId: ${ACCOUNTID}
  instanceId: ${INSTANCEID}
  instanceName: ${INSTANCENAME}
  region: ${REGION}
  az: ${AZ}
  environment: ${ENV}

logging.metrics.enabled: true
metrics.period: 60s
logging.level: info

#================================ Outputs =====================================
#----------------------------- Logstash output --------------------------------
output.logstash:
  hosts: ["indexer01:5000", "indexer02:5000", "indexer03:5000"]
  compression_level: 1
  worker: 2
  loadbalance: true
  ssl.certificate_authorities: ["/host/opt/filebeat/logstash.pem"]
  max_retries: -1

Logstash is version 5.1.2 with this input/output conf:

input {
  beats {
    port => "5000"
    ssl => "true"
    ssl_certificate => "/host/opt/logstash/logstash.pem"
    ssl_key => "/host/opt/logstash/logstash.key"
    client_inactivity_timeout => "900"
  }
}

output {
  ##If type ECS, Send to ES and PT. Sends to PT via rsyslog on host to get SSL. 
  ##Uses ecs cluster name+container name in PT
  if [type] == "ecs" {
syslog {
  facility => "local0"
  severity => "notice"
  host => "172.17.0.1"
  port => "514"
  appname => "%{ecsContainerName}"
  sourcehost => "%{ecsCluster}"
  protocol => "tcp"
}
elasticsearch {
  hosts => ["https://${ESENDPOINT}:443"]
  ssl => "true"
  manage_template => false
  index => "my-ecs-logs-%{+YYYY.MM.dd}"
}
 ......Several more outputs to the same ES cluster below, but to other index.
}

The beat plugin is version: 3.1.12

The Logstash containers runs on three c4.large with 16 workers each in AWS and is sending the logs to AWS ES and Papertrail(via rsyslog on the host to get SSL). I get no errors in my Logstash logs, however, I have not ran a debugging on them. There are no IP tables or similar configured on the Logstash host/containers.
You can find debug logs from Filebeat here: http://pastebin.com/NSNcwtS8

I don't see how this problem is related to publish_async: true. I can't even find an issue in the filebeat log output. Maybe some logstash debug logs would be helpful.

This setting is not really affecting how the outputs do work. With your setting, the outputs send a batch of events and wait for an ACK before trying to retransmit another batch. With publish_async enabled, more batches are prepared and held in the internal publisher queue for the outputs to handle. If publish_async is disabled the only difference is, another batch will be prepared only after having received an ACK from logstash for the active batch.

Does the problem persist if you disable publish_async and use batch splitting? spooler_size: 12288 and outputs.logstash.bulk_max_size: 2048 will split a batch into 6 sub-batches and forward those in a load-balanced manner. Only after all sub-batches have been ACKed will the pipeline forward the next batch (this is some kind of lock-step-load-balancing).

I wonder if issue is due to logstash or AWS closing your connections on purpose. I noted the windows size becoming 1. This is an indicator for the bad condition being active for quite some time.

I don't think this is related to publish_async: true either... I think this is a problem with Loadbalancing in general, I don't get what thought...

I tried to split the batch with your suggested config but I still have the same problem, connection reset.

I will try to get this up and running in a test environment, I get too many logs to be able to turn on debugging on the Logstash indexers.

EDIT:
I forgot to mention in my first Post that we're sending the logs over the Internet. The logstash Indexers are located in another VPC and/or AWS account than the Filebeats.

Is filebeat running on windows? There has been a report once, about windows internal firewall rules detecting filebeat sending too much data in parallel as a security issue and did close down the connection.

Without Logstash logs I can not really tell if it's another problem in logstash prematurely closing connections or a network problem. Btw. have you tried to inrease client_inactivity_timeout in logstash? Not sure if this is true, but It could be netty is always generating some kind of signal at set intervals (if no data have been read from socket) and only if logstash is actively processing a batch the signal is ignored (this would mean the timeout is somewhat prone to races).

If you can not collect debug logs from logstash, there is some other experiment you can run. Run tcpdump on every single host, monitoring the logstash port number only, collecting only packets with SYN, RCV, FIN. comparing the packets in the pcaps at same time, we can maybe figure out who is closing the connection.

Hi!

No it's all Linux. I increased the client_inactivity_timeout to 86400 (24h) and it looks way better. I will get back with more test and results.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.