Sometimes, filebeat keeps old files opened forever. lsof shows us that filebeat is at EOF, but for some unknown reasons it doesn't close the file.
I am using latest version of filebeat
Here is our setup :
output.logstash:
# The Logstash hosts
hosts: ["xx.yyy:zzz.141:5702","xx.yyy:zzz.142:5702"]
loadbalance: true
Concerning rotate procedure :
No specific rotation procedure exist .
New log file is created under specific directory its time a new "operator" logins the system and traces respective actions ( similar to audit ) .
Note ; Respective Log file could be idle for a long time period ( e.g 20 minutes ) before written again .
Housekeeping procedure is implemented on OS level using a script which proceeds with gzip when respective file is idle ( based on its file time modification ) for e.g 30 minutes and after a period of time is moving from respective directory .
Our issue appears when such files ( not all of them) , due to idle time ( file time modification ) > 30 minutes are gzipped .
Based on close_inactive conf attr ( 1m or 5m ) ,
it is expected that FileBeat has already closed the file
but no respective message "Closing because close_inactive of 1m0s reached"
exist @ Filebeat logfile .
Due to above when file is gzipped , a .deleted file appears on lsof ( or .nfs in case of NFS dir ) which reserves space @ disk .
The only way to get rid of specific entries and release space is restart of filebeat process .
Some points of interest based on our investigation
Close_timeout parameter - when used for testing - is working as expected .
There are consecutive log entries with "Harvester started for file: ...."
for the same filename , without any "Closing because close_inactive of 1m0s reached"
between of them .
( Start time of new harvest matches a new entry at respective log file after a long idle time )
Is it expected ? Have you noticed similar behavior ?
We intend to use clean_inactive/ignore_older parameters in order to clean registry . Do you agree with that ?
1.Concerning not expected behavior of close_inactive flag , could it be related to LS functionality ?
Should we check on LS side ?
Is it possible that LS do not send expected ack , and as a result close_inactive timer is not enabled ?
Respective messages appears often : ERR Failed to publish events: write tcp xxx:36859->yyy:5702: write: connection reset by peer
Could it be related with our issue ?
Do you suggest increasing value : client_inactivity_timeout ?
When using close_timeout , we have 2 different behaviors on log file :
a.
2017-10-05T07:35:41+03:00 INFO Closing harvester because close_timeout was reached.
2017-10-05T07:35:41+03:00 INFO Reader was closed: /New_sbllog/logs/xxxx_0013_13631523.log. Closing.
b. Only the first message appears without Reader message for specific log file .
The error above was what I was kind of looking for (or one of them) in when I asked for the logs. It means there is something off with the connection to LS. As long as FB did not get the confirmation that all events are sent, it will keep the file open. The only option to prevent this is close_timeout which will force close the file.
Do you see anything in the LS logs on why this connection is reset? Do you have a Load balancer or something similar in place?
As I don't see pipelining in the above LS output config, I assume you don't have it eanbled?
output.logstash:
# The Logstash hosts
hosts: ["xx.yyy:zzz.141:5702","xx.yyy:zzz.142:5702"]
loadbalance: true
Pipelining is not enabled .
No respective errors exist on LogStash Log FIles concerning same timestamp with Filebeat Log Messages.
Similar errors exist , but on different time frames.
Can you suggest an appropriate configuration for Logstash ?
2 Logstash instances running on 2 servers .
Should we increase client_inactivity_timeout as mentioned on respective issue ?
One related question :
Is there a safe "dummy" configuration to use in order to confirm that we do not have any issue on close_* parameters / OS etc. and issue is focused on LS config/output ? ( e.g using file output )
Increasing client_inactivity_timeout should remove the errors but in general it should not directly impact the problem.
As far as I understand you, close_timeout solves the problem? As long as your LS instances are not overloaded, the other close_* options should also release the file handlers.
Your best test environment to eliminate network in LS problems is definitively writing to files and see what happens. If the the problem does not occur when you write to file, it's a strong indicator that it's related to LS, network or LS output on the beats side.
How many events per second do you have? Are your LS instances under load?
Do you have any load balancer in between beats and logstash?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.