Missing a lot of logs sent via filebeats

We seem to be losing a lot of logs from just a couple of servers that we collect and send via filebeats. This is our basic configuration:

filebeat:
  prospectors: []
  registry_file: "/var/lib/filebeat/registry"
  config_dir: "/etc/filebeat/conf.d"
output:
  logstash:
    enabled: true
    hosts:
    - 10.252.250.30:5044
    - 10.252.250.53:5044
    - 10.252.250.59:5044
    - 10.252.250.60:5044
    loadbalance: true

logging:
    level: info
    to_syslog: true

and our prospectors file looks like:

filebeat:
  prospectors:
  - paths:
    - "/opt/bro/logs/current/conn.log"
    document_type: bro
    fields:
      type: conn
  - paths:
    - "/opt/bro/logs/current/files.log"
    document_type: bro
    fields:
      type: files

What happens is when we start filebeats, it collects logs and send them through logstash to kibana and we see logs for about a minute, and then they stop, and we begin to receive errors like this one:

Sep 21 21:06:05 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:76: Error publishing events (retrying): EOF
Sep 21 21:06:05 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:152: send fail
Sep 21 21:06:05 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:159: backoff retry: 1m0s
Sep 21 21:07:15 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:76: Error publishing events (retrying): EOF
Sep 21 21:07:15 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:152: send fail
Sep 21 21:07:15 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:159: backoff retry: 1m0s
Sep 21 21:08:25 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:76: Error publishing events (retrying): EOF
Sep 21 21:08:25 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:152: send fail
Sep 21 21:08:25 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:159: backoff retry: 1m0s
Sep 21 21:09:35 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:76: Error publishing events (retrying): EOF
Sep 21 21:09:35 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:152: send fail
Sep 21 21:09:35 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:159: backoff retry: 1m0s
Sep 21 21:10:35 nat-gateway-manager-uscen-a-c001-n001 /usr/bin/filebeat[21443]: single.go:126: Connecting error publishing events (retrying): dial tcp IPADDRESS:5044: getsockopt: connection refused

After that point, we only see sparse, intermittent logs, even though there are many more on the server.

Our logstash input for beats looks like this:

beats {
    port => 5044
    congestion_threshold => 25
}

We collect a ton of logs via filebeats (syslogs from hundreds of servers) and send them through the exact same logstash servers, and all of them seem to make it into elasticsearch just fine. However, on just a couple of our servers where we collect a high volume of logs, we encounter this issue where we get some logs and then they stop flowing, and then we only see random bits of logs intermittently. Have you guys seen an issue like this before?? How can we configure filebeats to manage particularly high loads??

Another note: during troubleshooting, we dropped the logstash instances that it forwards logs to from 4 to a single instance. When we do that, what we seem to find is that it no longer MISSES any logs, but chugs through the backlog incredibly slowly. If we add more servers to the loadbalancing scheme, it becomes less reliable.

Any help appreciated!!!

Which version of filebeat and logstash are you using? Do you have any logs from the logstash side?

@steffens Can you have a look here?

up! I am facing the same problem, changed the congestion_threshold too and also reduced the number of loggers machine (thus drastically reduced the throughput)...still the same problem

@il.bert Could you also share filebeat version, logstash version and both configs? In addition it would be good to know which version of the beats-input-plugin is used.

LS I used both 2.4 and 5.0
FB is 5alpha6

FB is the following

for each file I have this path conf

  fields_under_root: true
  ignore_older: 10m
  close_inactive: 2m
  clean_inactive: 15m


#========================= Filebeat global options ============================

filebeat.spool_size: 10000
filebeat.idle_timeout: 10s

#----------------------------- Logstash output --------------------------------
output.logstash:
  hosts: ["10.246.85.242:5044", "10.246.85.243:5044"]
  template.name: "filebeat"
  loadbalance: true
  bulk_max_size: 1000  
  template.path: "filebeat.template.json"
  template.overwrite: false

in LS i tried LOTS OF configuration 2, 3 and 4 instances of LS with different conf
the one that get me better results is the following, different configurations allowed me to pass a very smaller throughput (10% of the load) with this one I have almost 75%

path.data: /var/lib/logstash
pipeline.workers: 24
pipeline.output.workers: 1
pipeline.batch.size: 15000
path.config: /etc/logstash/conf.d
config.reload.automatic: true
config.reload.interval: 30
log.level: verbose
path.log: /var/log/logstash/logstash.log

with

-Xms12g
-Xmx12g

I have opened also this FileBeat EOF Error and this FileBeat file is falling under ignore older for other errors i am facing

hope you will help asap as my POC for my manager should be ready by next week

I am using:

filebeat: 1.2.3
beats-input-plugin: 2.0.3, later tested on 2.2.9 with no fix.
logstash: 2.1.0

I don't see any logstash errors. I believe before we changed congestion_threshold => 25 in logstash-beats-input, we had some

{:timestamp=>"2016-09-21T21:07:15.641000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}

It seems like this thread could be related? FileBeat EOF Error

@il.bert As you are on both threads, because you can share some insights?

Haha whoops, I posted my conclusion to the wrong thread. Here it is again:

Hey guys, I found the root of our issue was actually on a misconfigured filter on the logstash end that was clogging the filter workers queue. I guess that manifested itself eventually in filebeats's logs.

For me there seems to be a different problem!

I posted a workaround here, but it is not a solution FileBeat EOF Error
If you could read it, any help will be really appreciated!

This topic was automatically closed after 21 days. New replies are no longer allowed.