Filebeat to Logstash I/O Timeouts through AWS ELB

Hi all,

I am running Filebeat (6.0.0-beta2) on my Docker hosts to collect Docker logs (via the json-files driver) and sending them to a group of Logstash 5.5 instances for aggregation. I currently have several Logstash instances behind a load balancer (AWS ELB).

I am seeing this error in all of my Filebeat logs:

2017/09/13 14:59:41.649837 metrics.go:39: INFO Non-zero metrics in the last 30s: beat.memstats.memory_alloc=5457424 beat.memstats.memory_total=5457424 libbeat.output.events.batches=3 libbeat.output.events.failed=6082 libbeat.output.events.total=6082 libbeat.output.read.errors=1 libbeat.output.write.bytes=311749 libbeat.pipeline.events.retry=8068
2017/09/13 14:59:49.346407 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:60592->10.178.74.87:10200: i/o timeout
2017/09/13 14:59:49.346483 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:60592->10.178.74.87:10200: i/o timeout
2017/09/13 14:59:49.379885 async.go:233: ERR Failed to publish events caused by: client is not connected
2017/09/13 14:59:50.380037 output.go:92: ERR Failed to publish events: client is not connected
2017/09/13 15:00:11.649853 metrics.go:39: INFO Non-zero metrics in the last 30s: beat.memstats.memory_alloc=5450728 beat.memstats.memory_total=5450728 libbeat.output.events.batches=3 libbeat.output.events.failed=6082 libbeat.output.events.total=6082 libbeat.output.read.errors=1 libbeat.output.write.bytes=311151 libbeat.pipeline.events.retry=8068
2017/09/13 15:00:20.416805 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:50926->10.178.57.136:10200: i/o timeout
2017/09/13 15:00:20.416914 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:50926->10.178.57.136:10200: i/o timeout
2017/09/13 15:00:20.458262 async.go:233: ERR Failed to publish events caused by: client is not connected

I'm having trouble diagnosing this. At first, I thought that Logstash was under too much pressure, but I don't see any indication of this on the Logstash side.

From Logstash's Node Stat's API, here are a few things that I can see:

root@32f6f94bc186:/# curl -XGET 'localhost:9600/_node/stats/pipeline?pretty'
{
  "host" : "32f6f94bc186",
  "version" : "5.5.0",
  "http_address" : "127.0.0.1:9600",
  "id" : "e6eb8a73-0f61-41b1-8ea8-89a17992490c",
  "name" : "32f6f94bc186",
  "pipeline" : {
    "events" : {
      "duration_in_millis" : 14207432,
      "in" : 364783,
      "filtered" : 364767,
      "out" : 363788,
      "queue_push_duration_in_millis" : 70037190
    },
    "plugins" : {
      "inputs" : [ {
        "id" : "8227147f979f7be0e44e67dcb6605fbea593e389-1",
        "events" : {
          "out" : 364783,
          "queue_push_duration_in_millis" : 70037190
        },
        "current_connections" : 1421,
        "name" : "beats",
        "peak_connections" : 1421
      } ],

root@32f6f94bc186:/# curl -XGET 'localhost:9600/_node/stats/process?pretty'
{
  "host" : "32f6f94bc186",
  "version" : "5.5.0",
  "http_address" : "127.0.0.1:9600",
  "id" : "e6eb8a73-0f61-41b1-8ea8-89a17992490c",
  "name" : "32f6f94bc186",
  "process" : {
    "open_file_descriptors" : 2044,
    "peak_open_file_descriptors" : 2044,
    "max_file_descriptors" : 1048576,
    "mem" : {
      "total_virtual_in_bytes" : 7198326784
    },
    "cpu" : {
      "total_in_millis" : 604050,
      "percent" : 2,
      "load_average" : {
        "1m" : 0.88,
        "5m" : 1.47,
        "15m" : 1.68
      }
    }
  }

I have 28 Filebeat instances (28 Docker hosts), so why are there 1421 connections to this Logstash instance?

Here is my Filebeat configuration:

filebeat.prospectors:
- type: log
  paths:
   - '/var/lib/docker/containers/*/*.log'
  json.message_key: log
  json.keys_under_root: true
  processors:
  - add_docker_metadata: ~

output.logstash:
  enabled: true
  hosts: ["${LOGSTASH_HOST}:10200"]
  ttl: 60s
  logging.to_files: false
  max_retries: -1

I understand the "ttl" option was introduced in order to kill the persistent beats->Logstash connections which is helpful when you are using external load balancers in front of Logstash (like I am). However, I don't actually see Filebeat logging anything that it is killing connections every 60s.

Here is the input portion of my Logstash config:

input {
  beats {
    port => 10200
  }

This does not appear to be a network connectivity issue. I am able to successfully connect to TCP/10200 from the Filebeat instances to the Logstash ELB (and the ELB can reach the Logstash nodes on 10200).

Any ideas on what may be going on here?

Where are the connections coming from? sudo netstat -anp | grep 10200 should give you some more info to figure out where the connections are from.

If you turn on debug logging you should see some log messages for when it closes the connection and connects again.

logging.level: debug
logging.selectors: [logstash]
1 Like

I tried adding this under both the filebeat.prospectors and output.logstash config sections (and a separate logging section), but still don't see anything indicating that the connection is being closed and re-established every 60seconds.

It's part of the logging configuration section and should be added to the config file without any indentation. https://github.com/elastic/beats/blob/v6.0.0-beta2/filebeat/filebeat.reference.yml#L1170-L1175

Here are the logger statements for when it connects and closes the connection. https://github.com/elastic/beats/blob/v6.0.0-beta2/libbeat/outputs/logstash/sync.go#L54-L73

The TTL is only checked when events are about to be published. So if it is idle it won't be closing and re-opening connections, but when new events are ready it will check to see if the connection should be closed and re-opened. https://github.com/elastic/beats/blob/v6.0.0-beta2/libbeat/outputs/logstash/sync.go#L94-L106

Ok - thanks the explanation and links!

The timeout occurs when waiting for the ACK/Heartbeat signal from logstash. The default timeout in filebeat is 60 seconds. If logstash is actively processing a batch of events, it sends a heartbeat-signal every 5 seconds.

Reasons why this signal is not received by filebeat can be either network or contention in logstash (induced by additional back-pressure on outputs). The beats input in logstash uses netty. Netty multiplexes multiple clients on N workers threads. Not sure the number of worker threads is configurable. The default is something like 2*number of CPUs I think. If the logstash pipeline is stalled (e.g. expensive grok or unresponsive output), the worker threads will have to wait. If the number of connections exceeds the number of available threads, the other connections will not be served (which might create a timeout in filebeat).

Upon failure filebeat reconnects. Seems like logstash still has these file descriptors for these connections open, as no errors has been read from these FDs yet. Clients/open FDs are not served. The input plugin has a setting to close inactive connections (default 30s).

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.