Hi all,
I am running Filebeat (6.0.0-beta2) on my Docker hosts to collect Docker logs (via the json-files driver) and sending them to a group of Logstash 5.5 instances for aggregation. I currently have several Logstash instances behind a load balancer (AWS ELB).
I am seeing this error in all of my Filebeat logs:
2017/09/13 14:59:41.649837 metrics.go:39: INFO Non-zero metrics in the last 30s: beat.memstats.memory_alloc=5457424 beat.memstats.memory_total=5457424 libbeat.output.events.batches=3 libbeat.output.events.failed=6082 libbeat.output.events.total=6082 libbeat.output.read.errors=1 libbeat.output.write.bytes=311749 libbeat.pipeline.events.retry=8068
2017/09/13 14:59:49.346407 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:60592->10.178.74.87:10200: i/o timeout
2017/09/13 14:59:49.346483 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:60592->10.178.74.87:10200: i/o timeout
2017/09/13 14:59:49.379885 async.go:233: ERR Failed to publish events caused by: client is not connected
2017/09/13 14:59:50.380037 output.go:92: ERR Failed to publish events: client is not connected
2017/09/13 15:00:11.649853 metrics.go:39: INFO Non-zero metrics in the last 30s: beat.memstats.memory_alloc=5450728 beat.memstats.memory_total=5450728 libbeat.output.events.batches=3 libbeat.output.events.failed=6082 libbeat.output.events.total=6082 libbeat.output.read.errors=1 libbeat.output.write.bytes=311151 libbeat.pipeline.events.retry=8068
2017/09/13 15:00:20.416805 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:50926->10.178.57.136:10200: i/o timeout
2017/09/13 15:00:20.416914 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:50926->10.178.57.136:10200: i/o timeout
2017/09/13 15:00:20.458262 async.go:233: ERR Failed to publish events caused by: client is not connected
I'm having trouble diagnosing this. At first, I thought that Logstash was under too much pressure, but I don't see any indication of this on the Logstash side.
From Logstash's Node Stat's API, here are a few things that I can see:
root@32f6f94bc186:/# curl -XGET 'localhost:9600/_node/stats/pipeline?pretty'
{
  "host" : "32f6f94bc186",
  "version" : "5.5.0",
  "http_address" : "127.0.0.1:9600",
  "id" : "e6eb8a73-0f61-41b1-8ea8-89a17992490c",
  "name" : "32f6f94bc186",
  "pipeline" : {
    "events" : {
      "duration_in_millis" : 14207432,
      "in" : 364783,
      "filtered" : 364767,
      "out" : 363788,
      "queue_push_duration_in_millis" : 70037190
    },
    "plugins" : {
      "inputs" : [ {
        "id" : "8227147f979f7be0e44e67dcb6605fbea593e389-1",
        "events" : {
          "out" : 364783,
          "queue_push_duration_in_millis" : 70037190
        },
        "current_connections" : 1421,
        "name" : "beats",
        "peak_connections" : 1421
      } ],
root@32f6f94bc186:/# curl -XGET 'localhost:9600/_node/stats/process?pretty'
{
  "host" : "32f6f94bc186",
  "version" : "5.5.0",
  "http_address" : "127.0.0.1:9600",
  "id" : "e6eb8a73-0f61-41b1-8ea8-89a17992490c",
  "name" : "32f6f94bc186",
  "process" : {
    "open_file_descriptors" : 2044,
    "peak_open_file_descriptors" : 2044,
    "max_file_descriptors" : 1048576,
    "mem" : {
      "total_virtual_in_bytes" : 7198326784
    },
    "cpu" : {
      "total_in_millis" : 604050,
      "percent" : 2,
      "load_average" : {
        "1m" : 0.88,
        "5m" : 1.47,
        "15m" : 1.68
      }
    }
  }
I have 28 Filebeat instances (28 Docker hosts), so why are there 1421 connections to this Logstash instance?
Here is my Filebeat configuration:
filebeat.prospectors:
- type: log
  paths:
   - '/var/lib/docker/containers/*/*.log'
  json.message_key: log
  json.keys_under_root: true
  processors:
  - add_docker_metadata: ~
output.logstash:
  enabled: true
  hosts: ["${LOGSTASH_HOST}:10200"]
  ttl: 60s
  logging.to_files: false
  max_retries: -1
I understand the "ttl" option was introduced in order to kill the persistent beats->Logstash connections which is helpful when you are using external load balancers in front of Logstash (like I am). However, I don't actually see Filebeat logging anything that it is killing connections every 60s.
Here is the input portion of my Logstash config:
input {
  beats {
    port => 10200
  }
This does not appear to be a network connectivity issue. I am able to successfully connect to TCP/10200 from the Filebeat instances to the Logstash ELB (and the ELB can reach the Logstash nodes on 10200).
Any ideas on what may be going on here?