Filebeat never releases unupdated logfiles

Sometimes, filebeat keeps old files opened forever. lsof shows us that filebeat is at EOF, but for some unknown reasons it doesn't close the file.
I am using latest version of filebeat
Here is our setup :

close_inactive: 1m
close_timeout: 2h
close_renamed: true
close_removed: true
force_close_files: true

Could you share the following additional details?

  • Full config file
  • Your setup (LS, ES, ...)
  • Exact version (I assume 5.6.2?)

Also interesting to hear would be on how you rotate the logs.

Hello,
Config file : ###################### Filebeat Configuration Example #########################

This file is an example configuration file highlighting only the most common

options. The filebeat.full.yml file from the same directory contains all the

supported options with more comments. You can use it as a reference.

You can find the full configuration reference here:

https://www.elastic.co/guide/en/beats/filebeat/index.html

#=========================== Filebeat prospectors =============================

filebeat.prospectors:

Each - is a prospector. Most options can be set at the prospector level, so

you can use different prospectors for various configurations.

Below are the prospector specific configurations.

  • input_type: log
    paths:

    • /New_sbllog/logs/eCommunicationsObjMgr_enu*.log
      #exclude_files: ['.nfs.*']
      tags: ["ecommenu#tab","siebel","eComm"]
      multiline.pattern: '^[\w ]+\t[\w ]+\t\d{1}\t\w+:\d{1}\t\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\t.*'
      multiline.negate: true
      multiline.match: after
      close_inactive: 1m
      close_timeout: 2h
      close_renamed: true
      close_removed: true
      force_close_files: true

    #ignore_older: 30m

Version 5.6.2
We also tried 6
Oracle linux 6.5

Could you share the full config file including the output part and also:

  • Your setup (LS, ES, ...)
  • Details on how you rotate the logs.

Please make sure the formatting of the config files look good in the post. You can use three ticks before and after the config files.

  1. Concerning setup :

output.logstash:
# The Logstash hosts
hosts: ["xx.yyy:zzz.141:5702","xx.yyy:zzz.142:5702"]
loadbalance: true

  1. Concerning rotate procedure :
    No specific rotation procedure exist .
    New log file is created under specific directory its time a new "operator" logins the system and traces respective actions ( similar to audit ) .
    Note ; Respective Log file could be idle for a long time period ( e.g 20 minutes ) before written again .

Housekeeping procedure is implemented on OS level using a script which proceeds with gzip when respective file is idle ( based on its file time modification ) for e.g 30 minutes and after a period of time is moving from respective directory .

Our issue appears when such files ( not all of them) , due to idle time ( file time modification ) > 30 minutes are gzipped .
Based on close_inactive conf attr ( 1m or 5m ) ,
it is expected that FileBeat has already closed the file
but no respective message
"Closing because close_inactive of 1m0s reached"
exist @ Filebeat logfile .
Due to above when file is gzipped , a .deleted file appears on lsof ( or .nfs in case of NFS dir ) which reserves space @ disk .
The only way to get rid of specific entries and release space is restart of filebeat process .

Some points of interest based on our investigation

  • Close_timeout parameter - when used for testing - is working as expected .
  • There are consecutive log entries with
    "Harvester started for file: ...."
    for the same filename , without any
    "Closing because close_inactive of 1m0s reached"
    between of them .
    ( Start time of new harvest matches a new entry at respective log file after a long idle time )
    Is it expected ? Have you noticed similar behavior ?
  • We intend to use clean_inactive/ignore_older parameters in order to clean registry . Do you agree with that ?

1.Concerning not expected behavior of close_inactive flag , could it be related to LS functionality ?
Should we check on LS side ?
Is it possible that LS do not send expected ack , and as a result close_inactive timer is not enabled ?

  1. Respective messages appears often :
    ERR Failed to publish events: write tcp xxx:36859->yyy:5702: write: connection reset by peer
    Could it be related with our issue ?
    Do you suggest increasing value : client_inactivity_timeout ?

  2. When using close_timeout , we have 2 different behaviors on log file :
    a.

2017-10-05T07:35:41+03:00 INFO Closing harvester because close_timeout was reached.
2017-10-05T07:35:41+03:00 INFO Reader was closed: /New_sbllog/logs/xxxx_0013_13631523.log. Closing.

b. Only the first message appears without Reader message for specific log file .

The error above was what I was kind of looking for (or one of them) in when I asked for the logs. It means there is something off with the connection to LS. As long as FB did not get the confirmation that all events are sent, it will keep the file open. The only option to prevent this is close_timeout which will force close the file.

Do you see anything in the LS logs on why this connection is reset? Do you have a Load balancer or something similar in place?

As I don't see pipelining in the above LS output config, I assume you don't have it eanbled?

Thanks for your immediate response .

Configuration :

output.logstash:
# The Logstash hosts
hosts: ["xx.yyy:zzz.141:5702","xx.yyy:zzz.142:5702"]
loadbalance: true 

Pipelining is not enabled .

No respective errors exist on LogStash Log FIles concerning same timestamp with Filebeat Log Messages.
Similar errors exist , but on different time frames.


Can you suggest an appropriate configuration for Logstash ?
2 Logstash instances running on 2 servers .
Should we increase client_inactivity_timeout as mentioned on respective issue ?

One related question :
Is there a safe "dummy" configuration to use in order to confirm that we do not have any issue on close_* parameters / OS etc. and issue is focused on LS config/output ? ( e.g using file output )

Increasing client_inactivity_timeout should remove the errors but in general it should not directly impact the problem.

As far as I understand you, close_timeout solves the problem? As long as your LS instances are not overloaded, the other close_* options should also release the file handlers.

Your best test environment to eliminate network in LS problems is definitively writing to files and see what happens. If the the problem does not occur when you write to file, it's a strong indicator that it's related to LS, network or LS output on the beats side.

How many events per second do you have? Are your LS instances under load?

Do you have any load balancer in between beats and logstash?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.