Filebeat skipping files

Hi Team

I am using filebeat agent to monitor logs which are rotated after certain size. The system was working fine earlier however as the load increased and the frequency of rotation had increased filebeat seems to miss certain files in between.

Please suggest on the parameters I can set

How fast do you rotate files?

How do you rotate files (how fast are they deleted)?

How big are those files?

What is your filebeat config (scan frequency)?

Which output are you using?

Normally this is a sign that files/events are written faster then the downstream system can process.
Overall there is a bandwidth limit, normally due to the network/endpoint filebeat has to send data to. Filebeat can read files quite fast, but if the networks/endpoint limit the amount of data they can process, then filebeat just can't send faster.

Hi Steffens

Thanks for he update.

Please find below inputs you required
File rotation - 3-4 files a minute
Files are not deleted for a day only renamed. for e.g. abc100.log,abc101.log although the main input file remains abc.log
File size is 10 MB
Scan Frequency changes to 1 second as per suggestion in the document.
Ouput used is Logstash

All the setup is a single node i.e. ELK single node.
Filebeat configuration

ignore_older: 24h
scan_frequency: 1s
max_backoff: 1s
close_inactive: 10m

So it is reading six such files from different location in a single filebeat. Should I segregate i.e. run two filebeat process catering three at at time?
Should I make any changes to the parameters mentioned ?

Thanks
Sumit

Can you share logs and full filebeat config? Please use the </> button to format those.

The logs print some metrics every 30s. This allows us to see how many files are open, how many events have been send to Logstash and so on.

I don't think you have a harvester limit set. Which means filebeat should have all files open.

From experience a slow grok filter in Logstash can kill throughput.

Hi Please find below the filebeat input file.

filebeat.inputs:
- type: log
  enabled: true

  paths:
- /mnt3/WLS_OSB1/logs/WLS_OSB1-diagnostic.log
- /mnt3/WLS_OSB3/logs/WLS_OSB3-diagnostic.log
- /mnt3/WLS_OSB5/logs/WLS_OSB5-diagnostic.log
  ignore_older: 11m
  scan_frequency: 5s
  max_backoff: 5s
  close_inactive: 1m
  clean_inactive: 12m
  multiline.pattern: ^\[20
  multiline.negate: true
  multiline.match: after
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
setup.template.settings:
  index.number_of_shards: 3
setup.kibana:
output.logstash:
  hosts: ["localhost:7778"]
processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
logging.level: debug
logging.to_files: true
logging.files:
  path: /u02/elastic/log
  name: filebeat_mnt3
  keepfiles: 4
  permissions: 0644

Also I created a file output in Logstash without grok to compare the number of log lines but the count were not same. hence we can rule out GROK issue.

The setup was running fine however due to the increment in number of log lines per second I can see the issue coming.

Please suggest if there is any limitations

2019-08-16T18:24:54.459+0530    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":420,"time":{"ms":421}},"total":{"ticks":4910,"time":{"ms":4917},"value":4910},"user":{"ticks":4490,"time":{"ms":4496}}},"handles":{"limit":{"hard":16384,"soft":4096},"open":14},"info":{"ephemeral_id":"1e7271ad-b773-4127-b462-62f27efbc0f9","uptime":{"ms":30019}},"memstats":{"gc_next":79978688,"memory_alloc":66787696,"memory_total":1164832776,"rss":139440128}},"filebeat":{"events":{"active":4119,"added":28497,"done":24378},"harvester":{"open_files":6,"running":6,"started":6}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"output":{"events":{"acked":24357,"active":4096,"batches":15,"total":28453},"read":{"bytes":78},"type":"logstash","write":{"bytes":7087944}},"pipeline":{"clients":1,"events":{"active":4117,"filtered":21,"published":28473,"retry":2048,"total":28495},"queue":{"acked":24357}}},"registrar":{"states":{"cleanup":6,"current":239,"update":24378},"writes":{"success":23,"total":23}},"system":{"cpu":{"cores":4},"load":{"1":10.34,"15":12.27,"5":11.97,"norm":{"1":2.585,"15":3.0675,"5":2.9925}}}}}}
2019-08-16T18:25:17.474+0530    INFO    log/harvester.go:255    Harvester started for file: /mnt3/WLS_OSB1/logs/WLS_OSB1-diagnostic.log
2019-08-16T18:25:17.477+0530    INFO    log/harvester.go:255    Harvester started for file: /mnt3/WLS_OSB3/logs/WLS_OSB3-diagnostic.log
2019-08-16T18:25:17.482+0530    INFO    log/harvester.go:255    Harvester started for file: /mnt3/WLS_OSB5/logs/WLS_OSB5-diagnostic.log
2019-08-16T18:25:24.426+0530    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":610,"time":{"ms":192}},"total":{"ticks":8680,"time":{"ms":3767},"value":8680},"user":{"ticks":8070,"time":{"ms":3575}}},"handles":{"limit":{"hard":16384,"soft":4096},"open":17},"info":{"ephemeral_id":"1e7271ad-b773-4127-b462-62f27efbc0f9","uptime":{"ms":60016}},"memstats":{"gc_next":90522528,"memory_alloc":72922072,"memory_total":2108580032,"rss":6291456}},"filebeat":{"events":{"added":23393,"done":23393},"harvester":{"open_files":9,"running":9,"started":3}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":23390,"batches":13,"total":23390},"read":{"bytes":84},"write":{"bytes":5976010}},"pipeline":{"clients":1,"events":{"active":4117,"filtered":3,"published":23390,"total":23393},"queue":{"acked":23390}}},"registrar":{"states":{"current":242,"update":23393},"writes":{"success":13,"total":13}},"system":{"load":{"1":10.59,"15":12.24,"5":11.89,"norm":{"1":2.6475,"15":3.06,"5":2.9725}}}}}}
2019-08-16T18:25:36.786+0530    INFO    log/harvester.go:280    File is inactive: /mnt3/WLS_OSB5/logs/WLS_OSB5-diagnostic.log. Closing because close_inactive of 1m0s reached.
2019-08-16T18:25:43.318+0530    INFO    log/harvester.go:280    File is inactive: /mnt3/WLS_OSB1/logs/WLS_OSB1-diagnostic.log. Closing because close_inactive of 1m0s reached.
2019-08-16T18:25:45.764+0530    INFO    log/harvester.go:280    File is inactive: /mnt3/WLS_OSB3/logs/WLS_OSB3-diagnostic.log. Closing because close_inactive of 1m0s reached.
2019-08-16T18:25:50.714+0530    INFO    log/harvester.go:255    Harvester started for file: /mnt3/WLS_OSB1/logs/WLS_OSB1-diagnostic.log
2019-08-16T18:25:50.715+0530    INFO    log/harvester.go:255    Harvester started for file: /mnt3/WLS_OSB3/logs/WLS_OSB3-diagnostic.log
2019-08-16T18:25:50.722+0530    INFO    log/harvester.go:255    Harvester started for file: /mnt3/WLS_OSB5/logs/WLS_OSB5-diagnostic.log
2019-08-16T18:25:54.426+0530    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":820,"time":{"ms":213}},"total":{"ticks":12020,"time":{"ms":3351},"value":12020},"user":{"ticks":11200,"time":{"ms":3138}}},"handles":{"limit":{"hard":16384,"soft":4096},"open":17},"info":{"ephemeral_id":"1e7271ad-b773-4127-b462-62f27efbc0f9","uptime":{"ms":90016}},"memstats":{"gc_next":76873056,"memory_alloc":60059272,"memory_total":2946329872,"rss":241664}},"filebeat":{"events":{"added":20695,"done":20695},"harvester":{"closed":3,"open_files":9,"running":9,"started":3}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":20686,"batches":11,"total":20686},"read":{"bytes":72},"write":{"bytes":4935630}},"pipeline":{"clients":1,"events":{"active":4117,"filtered":9,"published":20686,"total":20695},"queue":{"acked":20686}}},"registrar":{"states":{"cleanup":3,"current":242,"update":20695},"writes":{"success":11,"total":11}},"system":{"load":{"1":12.15,"15":12.33,"5":12.19,"norm":{"1":3.0375,"15":3.0825,"5":3.0475}}}}}}

The snippet from the logs doesn't include many details. The metrics are logged every 30s.

All in all, it looks like filebeat is mostly idle, taking between 16-40% CPU on average. The files have been inactive for a while (no more updates), and suddenly the files are opened again and CPU goes up...

The output.event.acked metric is interesting. It is the number of events that have been confirmed to be processed by Logstash. The throughput seems to be about 760 events per second. This value is very low. Having low throughput and low CPU usage is either a sign of a) files getting new lines at low rates, or b) back-pressure due to network or some filter in Logstash being slow.

The path configuration themselfes make me somewhat suspicious. This might be a problem of its own, but how to you rotate files? Do you use truncate, or do you rename files?
Truncating files, has a high chance of data loss.