Filebeat: If tcp write or read error then filebeat stops harvesting files until restart

Hello
I am currently using the elk stack in v5.5 and i have a quite big issue:
Everytime i have a tcp write or read error i have to restart filebeat because it stops sending messages.

2017-07-25T11:14:46+02:00 ERR Failed to publish events caused by: write tcp 163.172.15.176:53430->163.172.99.57:5000: write: connection reset by peer
2017-07-25T11:14:46+02:00 ERR Failed to publish events caused by: write tcp 163.172.15.176:53428->163.172.99.57:5000: write: connection reset by peer
2017-07-25T11:14:47+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=9 libbeat.logstash.publish.read_bytes=350 libbeat.logstash.publish.write_bytes=5351704 libbeat.logstash.publish.write_errors=4 libbeat.logstash.published_and_acked_events=30181 libbeat.logstash.published_but_not_acked_events=12220 libbeat.publisher.published_events=30901 publish.events=30181 registrar.states.update=30181 registrar.writes=5
2017-07-25T11:15:17+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=4408 libbeat.logstash.publish.write_bytes=1727247 libbeat.logstash.published_and_acked_events=24497 libbeat.publisher.published_events=13739 publish.events=12277 registrar.states.update=12277 registrar.writes=2
2017-07-25T11:15:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:16:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:16:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:17:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:17:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:18:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:18:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:19:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:19:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:20:17+02:00 INFO No non-zero metrics in the last 30s
When i restart filebeat he push the missing messages and the new ones.

Here is my filebeat conf:

filebeat:
  name: "host7"
  spool_size: 16384
  prospectors:
  -
    paths:
      - /var/log/varnish/varnish.log
    input_type: log
    fields_under_root: true
    fields:
      tags: ['json', 'varnish']
      platform: boxes
    document_type: varnish-logs
    close_inactive: 5m

output.logstash:
  hosts: ["ls1:5000","ls2:5000"]
  loadbalance: true
  pipelining: 5
  worker: 2
  bulk_max_size: 8192
  ssl:
     certificate_authorities: ["/etc/filebeat/wildcard.ls.dev.logstash.crt"]

I have to send the logs to a distant datacenter, my logstash usualy gets 12k messages/s and i have the same pb on 5 differents plateforms (especialy the ones that don't send a lot of messages)

I started to have an extensive usage of filebeat (and started to loadbalance) since i migrated from 5.4 to 5.5 so i am not sure the problem happened since the 5.5 migration of if it would occur in 5.4.

Thanks !

Can you run filebeat with debug logs enabled?

logging.level: debug
logging.selectors: ["output", "logstash"]

Logstash will print close connection and connect messages on reconnect. Plus debug messages on number of events send. The 'output' selector might add messages like: add non-published events back into pipeline and async bulk publish success.

Please note, upon failure the client uses exponential backoff (but only up to 1 minute).

When you can kill filebeat with kill -ABRT <pid>, it will print a stack trace. Alternatively you can start filebeat with -httpprof :6060 and get a stack-trace of all go-routines via curl http://localhost:6060/debug/pprof/goroutine.

Having multiple stack-traces + debug logs can be helpful trying to identify if/where the outputs actually might hang.

The spool_size is only twice bulk_max_size. Why have 2 workers, with pipelining set to 5? Does the problem still occur If you set pipelining to 0?

My logstash servers are not in the same datacenter so ... while reading the configuration about pipeling i understood that setting pipelining (i put a random number for test) would permit to push other batches without waiting for an ack.

For the bulk_max_size and spool_size it's seems i misunderstood the lock step. Workers are not needed (events to N hosts in lock-step).

I add the debugging and httprof and will let you know when it break again

Thanks.

It looks like it s the pipelining option.
I removed it, i wiil see in the next days if it breaks again.
Thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.