Filebeat: If tcp write or read error then filebeat stops harvesting files until restart


(Alexandre Klein) #1

Hello
I am currently using the elk stack in v5.5 and i have a quite big issue:
Everytime i have a tcp write or read error i have to restart filebeat because it stops sending messages.

2017-07-25T11:14:46+02:00 ERR Failed to publish events caused by: write tcp 163.172.15.176:53430->163.172.99.57:5000: write: connection reset by peer
2017-07-25T11:14:46+02:00 ERR Failed to publish events caused by: write tcp 163.172.15.176:53428->163.172.99.57:5000: write: connection reset by peer
2017-07-25T11:14:47+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=9 libbeat.logstash.publish.read_bytes=350 libbeat.logstash.publish.write_bytes=5351704 libbeat.logstash.publish.write_errors=4 libbeat.logstash.published_and_acked_events=30181 libbeat.logstash.published_but_not_acked_events=12220 libbeat.publisher.published_events=30901 publish.events=30181 registrar.states.update=30181 registrar.writes=5
2017-07-25T11:15:17+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=4408 libbeat.logstash.publish.write_bytes=1727247 libbeat.logstash.published_and_acked_events=24497 libbeat.publisher.published_events=13739 publish.events=12277 registrar.states.update=12277 registrar.writes=2
2017-07-25T11:15:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:16:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:16:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:17:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:17:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:18:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:18:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:19:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:19:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:20:17+02:00 INFO No non-zero metrics in the last 30s
When i restart filebeat he push the missing messages and the new ones.

Here is my filebeat conf:

filebeat:
  name: "host7"
  spool_size: 16384
  prospectors:
  -
    paths:
      - /var/log/varnish/varnish.log
    input_type: log
    fields_under_root: true
    fields:
      tags: ['json', 'varnish']
      platform: boxes
    document_type: varnish-logs
    close_inactive: 5m

output.logstash:
  hosts: ["ls1:5000","ls2:5000"]
  loadbalance: true
  pipelining: 5
  worker: 2
  bulk_max_size: 8192
  ssl:
     certificate_authorities: ["/etc/filebeat/wildcard.ls.dev.logstash.crt"]

I have to send the logs to a distant datacenter, my logstash usualy gets 12k messages/s and i have the same pb on 5 differents plateforms (especialy the ones that don't send a lot of messages)

I started to have an extensive usage of filebeat (and started to loadbalance) since i migrated from 5.4 to 5.5 so i am not sure the problem happened since the 5.5 migration of if it would occur in 5.4.

Thanks !


(Steffen Siering) #2

Can you run filebeat with debug logs enabled?

logging.level: debug
logging.selectors: ["output", "logstash"]

Logstash will print close connection and connect messages on reconnect. Plus debug messages on number of events send. The 'output' selector might add messages like: add non-published events back into pipeline and async bulk publish success.

Please note, upon failure the client uses exponential backoff (but only up to 1 minute).

When you can kill filebeat with kill -ABRT <pid>, it will print a stack trace. Alternatively you can start filebeat with -httpprof :6060 and get a stack-trace of all go-routines via curl http://localhost:6060/debug/pprof/goroutine.

Having multiple stack-traces + debug logs can be helpful trying to identify if/where the outputs actually might hang.

The spool_size is only twice bulk_max_size. Why have 2 workers, with pipelining set to 5? Does the problem still occur If you set pipelining to 0?


(Alexandre Klein) #3

My logstash servers are not in the same datacenter so ... while reading the configuration about pipeling i understood that setting pipelining (i put a random number for test) would permit to push other batches without waiting for an ack.

For the bulk_max_size and spool_size it's seems i misunderstood the lock step. Workers are not needed (events to N hosts in lock-step).

I add the debugging and httprof and will let you know when it break again

Thanks.


(Alexandre Klein) #4

It looks like it s the pipelining option.
I removed it, i wiil see in the next days if it breaks again.
Thank you.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.