We seem to be losing a lot of logs from just a couple of servers that we collect and send via filebeats. This is our basic configuration:
filebeat:
prospectors: []
registry_file: "/var/lib/filebeat/registry"
config_dir: "/etc/filebeat/conf.d"
output:
logstash:
enabled: true
hosts:
- 10.252.250.30:5044
- 10.252.250.53:5044
- 10.252.250.59:5044
- 10.252.250.60:5044
loadbalance: true
logging:
level: info
to_syslog: true
and our prospectors file looks like:
filebeat:
prospectors:
- paths:
- "/opt/bro/logs/current/conn.log"
document_type: bro
fields:
type: conn
- paths:
- "/opt/bro/logs/current/files.log"
document_type: bro
fields:
type: files
What happens is when we start filebeats, it collects logs and send them through logstash to kibana and we see logs for about a minute, and then they stop, and we begin to receive errors like this one:
Sep 21 21:06:05 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:76: Error publishing events (retrying): EOF
Sep 21 21:06:05 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:152: send fail
Sep 21 21:06:05 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:159: backoff retry: 1m0s
Sep 21 21:07:15 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:76: Error publishing events (retrying): EOF
Sep 21 21:07:15 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:152: send fail
Sep 21 21:07:15 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:159: backoff retry: 1m0s
Sep 21 21:08:25 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:76: Error publishing events (retrying): EOF
Sep 21 21:08:25 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:152: send fail
Sep 21 21:08:25 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:159: backoff retry: 1m0s
Sep 21 21:09:35 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:76: Error publishing events (retrying): EOF
Sep 21 21:09:35 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:152: send fail
Sep 21 21:09:35 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:159: backoff retry: 1m0s
Sep 21 21:10:35 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:126: Connecting error publishing events (retrying): dial tcp IPADDRESS:5044: getsockopt: connection refused
After that point, we only see sparse, intermittent logs, even though there are many more on the server.
Our logstash input for beats looks like this:
beats {
port => 5044
congestion_threshold => 25
}
We collect a ton of logs via filebeats (syslogs from hundreds of servers) and send them through the exact same logstash servers, and all of them seem to make it into elasticsearch just fine. However, on just a couple of our servers where we collect a high volume of logs, we encounter this issue where we get some logs and then they stop flowing, and then we only see random bits of logs intermittently. Have you guys seen an issue like this before?? How can we configure filebeats to manage particularly high loads??
Another note: during troubleshooting, we dropped the logstash instances that it forwards logs to from 4 to a single instance. When we do that, what we seem to find is that it no longer MISSES any logs, but chugs through the backlog incredibly slowly. If we add more servers to the loadbalancing scheme, it becomes less reliable.
Any help appreciated!!!