Filebeat: If tcp write or read error then filebeat stops harvesting files until restart

Alexkl · July 26, 2017, 7:57am

Hello
I am currently using the elk stack in v5.5 and i have a quite big issue:
Everytime i have a tcp write or read error i have to restart filebeat because it stops sending messages.

2017-07-25T11:14:46+02:00 ERR Failed to publish events caused by: write tcp 163.172.15.176:53430->163.172.99.57:5000: write: connection reset by peer
2017-07-25T11:14:46+02:00 ERR Failed to publish events caused by: write tcp 163.172.15.176:53428->163.172.99.57:5000: write: connection reset by peer
2017-07-25T11:14:47+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=9 libbeat.logstash.publish.read_bytes=350 libbeat.logstash.publish.write_bytes=5351704 libbeat.logstash.publish.write_errors=4 libbeat.logstash.published_and_acked_events=30181 libbeat.logstash.published_but_not_acked_events=12220 libbeat.publisher.published_events=30901 publish.events=30181 registrar.states.update=30181 registrar.writes=5
2017-07-25T11:15:17+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=4408 libbeat.logstash.publish.write_bytes=1727247 libbeat.logstash.published_and_acked_events=24497 libbeat.publisher.published_events=13739 publish.events=12277 registrar.states.update=12277 registrar.writes=2
2017-07-25T11:15:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:16:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:16:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:17:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:17:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:18:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:18:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:19:17+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:19:47+02:00 INFO No non-zero metrics in the last 30s
2017-07-25T11:20:17+02:00 INFO No non-zero metrics in the last 30s
When i restart filebeat he push the missing messages and the new ones.

Here is my filebeat conf:

filebeat:
  name: "host7"
  spool_size: 16384
  prospectors:
  -
    paths:
      - /var/log/varnish/varnish.log
    input_type: log
    fields_under_root: true
    fields:
      tags: ['json', 'varnish']
      platform: boxes
    document_type: varnish-logs
    close_inactive: 5m

output.logstash:
  hosts: ["ls1:5000","ls2:5000"]
  loadbalance: true
  pipelining: 5
  worker: 2
  bulk_max_size: 8192
  ssl:
     certificate_authorities: ["/etc/filebeat/wildcard.ls.dev.logstash.crt"]

I have to send the logs to a distant datacenter, my logstash usualy gets 12k messages/s and i have the same pb on 5 differents plateforms (especialy the ones that don't send a lot of messages)

I started to have an extensive usage of filebeat (and started to loadbalance) since i migrated from 5.4 to 5.5 so i am not sure the problem happened since the 5.5 migration of if it would occur in 5.4.

Thanks !

steffens · July 26, 2017, 11:04am

Can you run filebeat with debug logs enabled?

logging.level: debug
logging.selectors: ["output", "logstash"]

Logstash will print close connection and connect messages on reconnect. Plus debug messages on number of events send. The 'output' selector might add messages like: add non-published events back into pipeline and async bulk publish success.

Please note, upon failure the client uses exponential backoff (but only up to 1 minute).

When you can kill filebeat with kill -ABRT <pid>, it will print a stack trace. Alternatively you can start filebeat with -httpprof :6060 and get a stack-trace of all go-routines via curl http://localhost:6060/debug/pprof/goroutine.

Having multiple stack-traces + debug logs can be helpful trying to identify if/where the outputs actually might hang.

The spool_size is only twice bulk_max_size. Why have 2 workers, with pipelining set to 5? Does the problem still occur If you set pipelining to 0?

Alexkl · July 26, 2017, 12:05pm

My logstash servers are not in the same datacenter so ... while reading the configuration about pipeling i understood that setting pipelining (i put a random number for test) would permit to push other batches without waiting for an ack.

For the bulk_max_size and spool_size it's seems i misunderstood the lock step. Workers are not needed (events to N hosts in lock-step).

I add the debugging and httprof and will let you know when it break again

Thanks.

Alexkl · July 27, 2017, 2:41pm

It looks like it s the pipelining option.
I removed it, i wiil see in the next days if it breaks again.
Thank you.

system · August 24, 2017, 2:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat works but keeps generating ERROR Messages in Logs Beats filebeat	3	432	March 27, 2020
ERR Failed to publish events caused by: write tcp - Filebeat Beats filebeat	3	10373	July 18, 2017
Error on filebeat -ERR Failed to publish events caused by: read tcp Beats filebeat	6	7797	July 17, 2017
Configuration error in filebeat in logstash Beats filebeat	2	606	April 9, 2018
Filebeat error on processing log file Beats filebeat	4	491	June 20, 2019

Filebeat: If tcp write or read error then filebeat stops harvesting files until restart

Related topics