Hi all,
I am running Filebeat (6.0.0-beta2) on my Docker hosts to collect Docker logs (via the json-files driver) and sending them to a group of Logstash 5.5 instances for aggregation. I currently have several Logstash instances behind a load balancer (AWS ELB).
I am seeing this error in all of my Filebeat logs:
2017/09/13 14:59:41.649837 metrics.go:39: INFO Non-zero metrics in the last 30s: beat.memstats.memory_alloc=5457424 beat.memstats.memory_total=5457424 libbeat.output.events.batches=3 libbeat.output.events.failed=6082 libbeat.output.events.total=6082 libbeat.output.read.errors=1 libbeat.output.write.bytes=311749 libbeat.pipeline.events.retry=8068
2017/09/13 14:59:49.346407 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:60592->10.178.74.87:10200: i/o timeout
2017/09/13 14:59:49.346483 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:60592->10.178.74.87:10200: i/o timeout
2017/09/13 14:59:49.379885 async.go:233: ERR Failed to publish events caused by: client is not connected
2017/09/13 14:59:50.380037 output.go:92: ERR Failed to publish events: client is not connected
2017/09/13 15:00:11.649853 metrics.go:39: INFO Non-zero metrics in the last 30s: beat.memstats.memory_alloc=5450728 beat.memstats.memory_total=5450728 libbeat.output.events.batches=3 libbeat.output.events.failed=6082 libbeat.output.events.total=6082 libbeat.output.read.errors=1 libbeat.output.write.bytes=311151 libbeat.pipeline.events.retry=8068
2017/09/13 15:00:20.416805 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:50926->10.178.57.136:10200: i/o timeout
2017/09/13 15:00:20.416914 async.go:233: ERR Failed to publish events caused by: read tcp 172.17.0.1:50926->10.178.57.136:10200: i/o timeout
2017/09/13 15:00:20.458262 async.go:233: ERR Failed to publish events caused by: client is not connected
I'm having trouble diagnosing this. At first, I thought that Logstash was under too much pressure, but I don't see any indication of this on the Logstash side.
From Logstash's Node Stat's API, here are a few things that I can see:
root@32f6f94bc186:/# curl -XGET 'localhost:9600/_node/stats/pipeline?pretty'
{
"host" : "32f6f94bc186",
"version" : "5.5.0",
"http_address" : "127.0.0.1:9600",
"id" : "e6eb8a73-0f61-41b1-8ea8-89a17992490c",
"name" : "32f6f94bc186",
"pipeline" : {
"events" : {
"duration_in_millis" : 14207432,
"in" : 364783,
"filtered" : 364767,
"out" : 363788,
"queue_push_duration_in_millis" : 70037190
},
"plugins" : {
"inputs" : [ {
"id" : "8227147f979f7be0e44e67dcb6605fbea593e389-1",
"events" : {
"out" : 364783,
"queue_push_duration_in_millis" : 70037190
},
"current_connections" : 1421,
"name" : "beats",
"peak_connections" : 1421
} ],
root@32f6f94bc186:/# curl -XGET 'localhost:9600/_node/stats/process?pretty'
{
"host" : "32f6f94bc186",
"version" : "5.5.0",
"http_address" : "127.0.0.1:9600",
"id" : "e6eb8a73-0f61-41b1-8ea8-89a17992490c",
"name" : "32f6f94bc186",
"process" : {
"open_file_descriptors" : 2044,
"peak_open_file_descriptors" : 2044,
"max_file_descriptors" : 1048576,
"mem" : {
"total_virtual_in_bytes" : 7198326784
},
"cpu" : {
"total_in_millis" : 604050,
"percent" : 2,
"load_average" : {
"1m" : 0.88,
"5m" : 1.47,
"15m" : 1.68
}
}
}
I have 28 Filebeat instances (28 Docker hosts), so why are there 1421 connections to this Logstash instance?
Here is my Filebeat configuration:
filebeat.prospectors:
- type: log
paths:
- '/var/lib/docker/containers/*/*.log'
json.message_key: log
json.keys_under_root: true
processors:
- add_docker_metadata: ~
output.logstash:
enabled: true
hosts: ["${LOGSTASH_HOST}:10200"]
ttl: 60s
logging.to_files: false
max_retries: -1
I understand the "ttl" option was introduced in order to kill the persistent beats->Logstash connections which is helpful when you are using external load balancers in front of Logstash (like I am). However, I don't actually see Filebeat logging anything that it is killing connections every 60s.
Here is the input portion of my Logstash config:
input {
beats {
port => 10200
}
This does not appear to be a network connectivity issue. I am able to successfully connect to TCP/10200 from the Filebeat instances to the Logstash ELB (and the ELB can reach the Logstash nodes on 10200).
Any ideas on what may be going on here?