Filebeat stops processing and does not retry

Hello,

Using Filebeat 5.5.1 on RHEL 6.9, directly pushing logs to ES 5.5.0, I'm having this strange behavior during Filbeat execution which is configured to harvest one single test log file of ~100 Gb:

[... runs successfully during about 15 minutes ...]
    2017-08-04T19:07:59+02:00 INFO Non-zero metrics in the last 30s: libbeat.es.call_count.PublishEvents=240 libbeat.es.publish.read_bytes=933088 libbeat.es.publish.write_bytes=17833330 libbeat.es.published_and_acked_events=258048 libbeat.publisher.published_events=245760 publish.events=245760 registrar.states.update=245760 registrar.writes=10
    2017-08-04T19:08:29+02:00 INFO Non-zero metrics in the last 30s: libbeat.es.call_count.PublishEvents=288 libbeat.es.publish.read_bytes=1036374 libbeat.es.publish.write_bytes=22013455 libbeat.es.published_and_acked_events=285696 libbeat.publisher.published_events=294912 publish.events=294912 registrar.states.update=294912 registrar.writes=12
    2017-08-04T19:08:58+02:00 INFO Stopping filebeat
    2017-08-04T19:08:58+02:00 INFO Prospector outlet closed
    2017-08-04T19:08:58+02:00 INFO Prospector outlet closed
    2017-08-04T19:08:58+02:00 INFO Prospector outlet closed
    2017-08-04T19:08:58+02:00 INFO Prospector channel stopped because beat is stopping.
    2017-08-04T19:08:59+02:00 INFO Non-zero metrics in the last 30s: libbeat.es.call_count.PublishEvents=264 libbeat.es.publish.read_bytes=1008692 libbeat.es.publish.write_bytes=20656165 libbeat.es.published_and_acked_events=277504 libbeat.publisher.published_events=270336 publish.events=270336 registrar.states.update=270336 registrar.writes=11
    2017-08-04T19:16:29+02:00 INFO No non-zero metrics in the last 30s
    2017-08-04T19:16:59+02:00 INFO No non-zero metrics in the last 30s
[... and then this log message repeats forever ...]

I haven't triggered any stop, still you can see these worrisome messages: "Stopping filebeat" and "Prospector channel stopped because beat is stopping."

It would take normally several hours to publish the 100 Gb log file, so I don't think Filebeat reached the EOF after only 15 minutes of processing.

My configuration:

filebeat.prospectors:
- input_type: log
  paths:
  - /some/path/jboss.log
  exclude_files: ['\.gz$']
  ignore_older: 2h
  fields_under_root: true
  fields:
    type: jboss
    Platform: some-platform
  multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} [A-Z]+ +\('
  multiline.negate: true
  multiline.match: after
  harvester_buffer_size: 2097152

# 3 nodes X 8 workers X 1024 bulk bytes
filebeat.spool_size: 24576

output.elasticsearch:
  enabled: true
  hosts: ["node1:9200", "node2:9200", "node3:9200"]
  template.enabled: false
  index: "someindex-%{+yyyy.MM.dd.HH}"
  pipeline: "some-pipeline"
  worker: 8
  bulk_max_size: 1024
  compression_level: 3

Thanks for helping out,
MG

I have added this parameter (even though as per documentation it is normally ignored by FB as FB is expected to retry indefintely):

output.elasticsearch:
....
  max_retries: 10

And it works as expected. So either I messed up somewhere with logs and/or the age of the log file during my first test which failed, either this max_retries really helped. Hard to say, not easy to reproduce all conditions.

filebeat always sets max_retries: -1. This behaviour can not be overwritten from config file.

For testing it's always helpful to 'reset' state. That is, remove the registry file and update the files timestamps via touch.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.