Filebeat 5.4.1 not updating registry when file closed/removed

Hello,

I'm running Filebeat & ELK 5.4.1 on CentOS 7. The server where Filebeat runs hosts 2 applications whose logs get rotated weekly. Filebeat is setup to close an inactive file after 7 minutes and to remove a file from it's registry if the file is deleted. Neither the close or the removal happens, based on Filebeat's debug log & the contents of the registry. This happens on about 50% of application servers, whose Filebeat- & server setup is just about identical.

As far as I can deduce Filebeat does not recognize that a file has been inactive or closed as the debug log does do not close a file after it's inactivity timeout has been reached and, when that happens, none of the subsequent actions take place (remove from registry etc.).

o- Is there any way to 'force' Filebeat to update it's registry?

o- How do Filebeat determine that a file is inactive? (I am 100% sure that a file is not being updated yet Filebeat, on SOME servers, never does a 'close inactive').

I have seen mention in the Elastic manuals that Filebeat sometimes 'waits' for a filesystem that might have become unavailable to become available and that could be why it does not update it's registry. This is not the the case for me as the file(s) are on disks that are permanently attached to the server. Also - I think catering for filesystems that are available intermittendly should be a feature that could be disabled via a config option as surely it is the exception rather than the rule (apologies if such an option is available).

The logrotate occurred at around 11:06 in the Filebeat log (below)) - please note that the logs is for 2 application log files. I do see this message 'File rename detected but harvester not finished yet' AFTER the rotate/rename has happened.

Below Filebeat's debug log, config file & logrotate config file.

Filebeat.yml

filebeat.prospectors:
- input_type: log
  paths:
    - /ApplicationOnePath/production.log*
  document_type: ElasticIndex01
  exclude_files: ['\.gz$']
  close_inactive: 7m
  clean_removed: true
  close_removed: true
  scan_frequency: 11s

- input_type: log
  paths:
    - /ApplicationTwoPath/production.log*
  document_type: ElasticIndex02
  exclude_files: ['\.gz$']
  close_inactive: 7m
  clean_removed: true
  close_removed: true
  scan_frequency: 11s

output.logstash:
  hosts: ['10.0.202.32:5044']
  loadbalance: false
  worker: 2

logging.level: info
logging.to_files: true
logging.to_syslog: false
logging.files:
  path: /var/log/filebeat/
  name: filebeat.log
  keepfiles: 4

Logrotate config

create
dateformat -%Y%m%d
dateext
rotate 1

/ApplicationOnePath/production.log {
    create 664 AppUser AppUser
    ifempty
    missingok
    nocompress
    nocopy
    nocopytruncate
    nomail
    su AppUser AppUser
}

/ApplicationTwoPath/production.log {
    create 664 AppUser AppUser
    ifempty
    missingok
    nocompress
    nocopy
    nocopytruncate
    nomail
    su AppUser AppUser
}

I have a filebeat debug log available but see no facility for me to attach a compressed version to this ticket. I have compared this log to that of servers where Filebeat functions as documented and can confirm that, for example, messages stating that a file is being closed due to being inactive do not occur in the non-working log. Please advise if you require further log info.

We can 'force' Filebeat's registry to be 'corrected' by deleting the registry while Filebeat is running but, as this is a undocumented 'hack', do not see that as an acceptable workaround on production servers. I have tried restarting Filebeat without deleting the registry without success.

Thanks for the detailed report. On a first glance I couldn't see anything special. On the machines that this issue happens, do you see any errors for sending events to LS? Could you put your log file into a gist and link it here?

What is the file system type you are using? By constantly attached you mean it is not mounted as a network volume, right?

Hi Ruflin & Thanks for your reply

Filebeat debug log is at https://gist.github.com/Sean1e/eaff55531e7722ea24858eec037ca609

Please note that it contains 2 logs

Basically - The application servers in question are pretty 'standard', i.e. created using the same scripts about 6 months ago

Nothing obvious happening on the servers where the issue occurs. One example of a working/non-working pair is a pair of duplicate application servers with no 'obvious' difference in application or system setup

My thoughts (guess) are something like a prior unsuccessful write to Logstash/Elastic is causing one server to say 'I might still have things to do with this log' while the other one says 'All done' :wink:

The filesystems are XFS and, in all cases, the logs in question are on the root drive of the server (i.e. not a mounted drive).

I'm very unclear about the 'waiting to see if a filesystem re-appears' functionality of Filebeat - It seems to be a 'uncommon' situation and, to my mind at least, could potentially cause issues. It sounds like something that a ELK user should be able to disable.

Also - I have noticed that if Filebeat do not recognize that a file has been inactive and can be closed or renamed it also do not recognize that a file has been deleted and can be removed from the registry. In all non-working cases a files' inode was not the issue (ie. was not reused etc). This is why I was hoping for an option to tell Filebeat 'Stop waiting for files that might re-appear & re-create your registry using purely the info in it's config file & file 'progress' as kept by a processor in ram & remove files from the registry if it's inode do not exist anymore.

Thanks again Ruflin
Séan

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.