I am having an issue with beta1. It is holding open a lot of deleted files even though the files have not been updated for a time much greater than close_older. I am not setting close_after so it should be defaulting to 5 minutes.
<stapp@g2t4454c:filebeat5> date Tue Oct 11 21:05:16 UTC 2016
The pattern of the filenames include a timestamp of when the file was rotated. The file was not updated after this timestamp.
Its not that close_older does not work at all it does seem to be closing some files. I am wondering if the problem is only related to deleted files.
Sampling the log for some of the files...
<stapp@g2t4454c:filebeat5> grep fast-20161009093057.log logs/* logs/filebeat.log:2016-10-11T13:38:35Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.2:2016-10-11T13:36:59Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.4:2016-10-09T15:16:57Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.6:2016-10-09T15:08:06Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log <stapp@g2t4454c:filebeat5> grep acceptfile-20161010111444.log logs/* logs/filebeat.log:2016-10-11T13:38:35Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010111444.log logs/filebeat.log.2:2016-10-11T13:37:08Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010111444.log <stapp@g2t4454c:filebeat5> grep fast-20161009093057.log logs/* logs/filebeat.log:2016-10-11T13:38:35Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.2:2016-10-11T13:36:59Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.4:2016-10-09T15:16:57Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.6:2016-10-09T15:08:06Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log
Note the differing paths is the result of the use of a symbolic link in the prospector config.
Here we can see that some files are being closed.
2016-10-11T15:16:18Z INFO File is inactive: /opt/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010154747.log. Closing because close_inactive of 5m0s reached. 2016-10-11T15:49:23Z INFO File is inactive: /opt/mount1/sa-stats/log/local/axedalistener/axedalistener-20161010082957.log. Closing because close_inactive of 5m0s reached. 2016-10-11T15:54:46Z INFO File is inactive: /opt/mount1/sa-stats/log/local/axedalistener/axedalistener-20161010082957.log. Closing because close_inactive of 5m0s reached.
Can you share your config file? Can you add some details on the exact way you do file rotation? (rename file to add tiemstamp, create new file)? Also can you elaborate on the way you use symlinks?
There is nothing special I found in the config. close_inactive is not defined which means the default of 5 minutes is used. Logrotation looks like this is the "normal" one.
Any chance that you could share the full log file of filebeat with me?
An other thing I spotted is mount1 in the filebeat config which for me indicates that you are reading logs from a shared drive?
About symlinks: I see, not the files itself are a symlinks but the directory with the files is a symlink. This should have no affect as for filebeat it looks like it reads the original file.
mount1 is a not a shared drive. Here filebeat is running on a physical machine and mount1 is mount point for an LVM volume consisting of a single physical disk which is directly attached.
Is there a non-public way I can send you the logs?
We're writing to let you know that the group you tried to contact (beats) may not exist, or you may not have permission to post messages to the group. A few more details on why you weren't able to post:
You might have spelled or formatted the group name incorrectly.
The owner of the group may have removed this group.
You may need to join the group before receiving permission to post.
@chris.mcdermott Thanks for the log files. Looking at your log files shows me lots of sending errors. As long as events cannot be sent to the output, the harvester will keep the file open. A workaround here is using close_timeout but this will lead to data loss in your case.
More important seems to me to investigate why quite frequently the sending of events to logstash fails. Interesting is that not all events fail, but only a subset.
2016-10-11T13:40:05Z INFO Non-zero metrics in the last 30s: libbeat.logstash.published_but_not_acked_events=4096 libbeat.logstash.call_count.PublishEvents=4 libbeat.publisher.published_events=4037 registrar.writes=2 libbeat.logstash.published_and_acked_events=3980 libbeat.logstash.publish.write_bytes=826186 registrar.states.update=4096 publish.events=4096 libbeat.logstash.publish.read_bytes=210 libbeat.logstash.publish.read_errors=2
That means the connection works but for example Logstash cannot keep up with all the events or network is flaky. How many filebeat instances are their sending the events to Logstash?
There are a lot of filebeat instances sending to logstash. Notice the problem is that logstash is dropping the connection (see the EOF errors). Its the latest version of logstash and we had upped the client timeout to a pretty hight number (300s I believe), but for some reason it prematurely terminates the connections.
As I have the suspicion that the not closing of deleted files is related to the connection issue, I suggest we focus first on the connection issues. Do you see any errors / messages on the Logstash side? Could it be that Logstash cannot keep up with the load of all the filebeat instances sending data? How many events per second do you have on the LS side?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.