There might still be backpressure in the system. The logs do print some internal metrics every 30s. Can you share some of these?
If you run filebeat with -d 'logstash', you can enable debug logging for the logstash output. This log will contain information about number of events being transmitted + ACKs. Comparing these timestamp gives you an idea about event/lines LS can process from this filebeat instance.
By configuring close_timeout: 240s, you can force filebeat to close and re-open the files every now and so often.
True, Load Average there is ~20 all the time, but two other servers have ~15 LA and have no such issues.
Yes, we use close_timeout as a trick on that server.
On the second server we have much more logs and during business time have up to 2 hours lag but never have locked deleted files.
2017-11-10T13:40:12Z INFO Non-zero metrics in the last 1m0s: filebeat.harvester.open_files=1 filebeat.harvester.running=1 filebeat.harvester.started=1 libbeat.logstash.call_count.PublishEvents=21 libbeat.logstash.publish.read_bytes=126 libbeat.logstash.publish.write_bytes=6709789 libbeat.logstash.published_and_acked_events=42844 libbeat.publisher.published_events=42839 publish.events=43008 registrar.states.update=43008 registrar.writes=21
2017-11-10T13:41:12Z INFO Non-zero metrics in the last 1m0s: libbeat.logstash.call_count.PublishEvents=17 libbeat.logstash.publish.read_bytes=102 libbeat.logstash.publish.write_bytes=5347797 libbeat.logstash.published_and_acked_events=34679 libbeat.publisher.published_events=34680 publish.events=34816 registrar.states.update=34350 registrar.writes=16
2017-11-10T13:42:12Z INFO Non-zero metrics in the last 1m0s: libbeat.logstash.call_count.PublishEvents=15 libbeat.logstash.publish.read_bytes=90 libbeat.logstash.publish.write_bytes=4783812 libbeat.logstash.published_and_acked_events=30517 libbeat.publisher.published_events=30512 publish.events=30720 registrar.states.update=31186 registrar.writes=16
The indentation issue would because of me copying and pasting in gist. As conveyed that should not be related to the file descriptor to be hanging.
We did an experiment and below are the results
I was continuously writting log lines to the file named as temp.log and configured in FB to consider this log for transferring to LS.
obsevation:
--- The File descriptor was opened for file temp.log
--- I stopped writing the log lines to temp.log
--- file descriptor was removed after 5s. ( as this is the default inactivity timeout value)
I was continuously writting log lines to the file named as temp.log and configured in FB to consider this log for transferring to LS.
--- The File descriptor was opened for file temp.log
--- I moved the temp.log to temp.1.log and touch new empty file as temp.log
--- file descriptor temp.log changed to temp.1.log
--- Started wrting to temp.log
--- file descriptor created for temp.log
--- removed the temp.1.log file and i see file descriptor is hanging.
ls -l /proc//fd/ | grep deleted
lr-x------. 1 logging logging 64 Nov 20 14:59 /proc/14773/fd/11 - /root/temp.1.log (deleted)
Assuming that FB was not able read till EOF and mean while if the file is removed then the file descriptor is hanging until we restart the FB again.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.