Filebeat 5.0 beta 1 is not closing deleted files

Hello,

I am having an issue with beta1. It is holding open a lot of deleted files even though the files have not been updated for a time much greater than close_older. I am not setting close_after so it should be defaulting to 5 minutes.

<stapp@g2t4454c:filebeat5> lsof | grep 25938 | grep log/local | grep '(deleted)' | head filebeat 25938 stapp 41r REG 253,8 52432329 1443043 /usr/local/st/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010103824.log (deleted) filebeat 25938 stapp 43r REG 253,8 52428947 1444258 /usr/local/st/mount1/sa-stats/log/local/fast/fast-20161009080815.log (deleted) filebeat 25938 stapp 57r REG 253,8 20971721 1443090 /usr/local/st/mount1/sa-stats/log/local/axedalistener/axedalistener-20161008231614.log (deleted) filebeat 25938 stapp 79r REG 253,8 20971560 1443032 /usr/local/st/mount1/sa-stats/log/local/axedalistener/axedalistener-20161008222049.log (deleted) filebeat 25938 stapp 111r REG 253,8 52429002 1444294 /usr/local/st/mount1/sa-stats/log/local/fast/fast-20161009091241.log (deleted) filebeat 25938 stapp 116r REG 253,8 52428921 1443038 /usr/local/st/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010103334.log (deleted) filebeat 25938 stapp 123r REG 253,8 20971573 1443018 /usr/local/st/mount1/sa-stats/log/local/axedalistener/axedalistener-20161008220239.log (deleted) filebeat 25938 stapp 124r REG 253,8 52428911 1444304 /usr/local/st/mount1/sa-stats/log/local/fast/fast-20161009093057.log (deleted) filebeat 25938 stapp 130r REG 253,8 52429843 1443118 /usr/local/st/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010111444.log (deleted) filebeat 25938 stapp 166r REG 253,8 52435873 1443124 /usr/local/st/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010111955.log (deleted)

<stapp@g2t4454c:filebeat5> date Tue Oct 11 21:05:16 UTC 2016

The pattern of the filenames include a timestamp of when the file was rotated. The file was not updated after this timestamp.

Its not that close_older does not work at all it does seem to be closing some files. I am wondering if the problem is only related to deleted files.

Sampling the log for some of the files...

<stapp@g2t4454c:filebeat5> grep fast-20161009093057.log logs/* logs/filebeat.log:2016-10-11T13:38:35Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.2:2016-10-11T13:36:59Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.4:2016-10-09T15:16:57Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.6:2016-10-09T15:08:06Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log
<stapp@g2t4454c:filebeat5> grep acceptfile-20161010111444.log logs/* logs/filebeat.log:2016-10-11T13:38:35Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010111444.log logs/filebeat.log.2:2016-10-11T13:37:08Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010111444.log
<stapp@g2t4454c:filebeat5> grep fast-20161009093057.log logs/* logs/filebeat.log:2016-10-11T13:38:35Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.2:2016-10-11T13:36:59Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.4:2016-10-09T15:16:57Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log logs/filebeat.log.6:2016-10-09T15:08:06Z INFO Harvester started for file: /opt/mount1/sa-stats/log/local/fast/fast-20161009093057.log
Note the differing paths is the result of the use of a symbolic link in the prospector config.

Here we can see that some files are being closed.

2016-10-11T15:16:18Z INFO File is inactive: /opt/mount1/sa-stats/log/local/acceptfile/acceptfile-20161010154747.log. Closing because close_inactive of 5m0s reached. 2016-10-11T15:49:23Z INFO File is inactive: /opt/mount1/sa-stats/log/local/axedalistener/axedalistener-20161010082957.log. Closing because close_inactive of 5m0s reached. 2016-10-11T15:54:46Z INFO File is inactive: /opt/mount1/sa-stats/log/local/axedalistener/axedalistener-20161010082957.log. Closing because close_inactive of 5m0s reached.

Thanks in advance for any help you can provide.

Can you share your config file? Can you add some details on the exact way you do file rotation? (rename file to add tiemstamp, create new file)? Also can you elaborate on the way you use symlinks?

filebeat.yml

filebeat:
    prospectors:
    -
      paths:
        - /opt/mount1/sa-stats/log/local/fast/*.log
        - /opt/mount1/sa-stats/log/local/filepuller/*.log
        - /opt/mount1/sa-stats/log/local/acceptfile/*.log
        - /opt/mount1/sa-stats/log/local/axedalistener/*.log
        - /opt/mount1/sa-stats/log/local/poller/*.log
        - /opt/mount1/sa-stats/log/local/groupnot/*.log
        - /opt/mount1/sa-stats/log/local/pluginhandler/*.log
      encoding: plain
      input_type: log
      close_removed: true
      clean_removed: true
      ignore_older: 168h
      document_type: log4j-itg
      multiline:
        pattern: "^[0-9]{4}-[0-9]{2}-[0-9]{2}T([0-9]{2}:){2}[0-9]{2},[0-9]"
        negate: true
        match: after
    -
      paths:
        - /opt/mount1/sa-stats/log/local/legacy/legacy-*.log
      encoding: plain
      input_type: log
      document_type: legacy-itg
      multiline:
        pattern: ^(.*)-*[[0-9]{4}-[0-9]{2}-[0-9]{2}T([0-9]{2}:){2}[0-9]{2},[0-9]
        negate: true
        match: before
  registry_file: /usr/local/st/filebeat5/registry
output:
  logstash:
    hosts: ["logstash.example.com9700"]
    worker: 15
    loadbalance: true
logging:
  to_syslog: false
  to_files: true
  files:
    path: /usr/local/st/filebeat5/logs
    name: filebeat.log
  level: info

The applications are written in Java and use log4j to do the file rotation. I believe that they use

  1. close
  2. rename
  3. create new

Here is an example configuration

            <RollingFile name="RollingFileLocal" fileName="/opt/mount1/sa-stats/log/local/fast/fast.log"
                                    filePattern="/opt/mount1/sa-stats/log/local/fast/fast-%d{yyyyMMddHHmmss}.log">
                    <PatternLayout>
                            <Pattern>%d{ISO8601} %5p [%t] [%X{logger.client}] [%X{logger.module}] [%X{logger.vendor} %X{logger.product} %X{logger.productId} %X{logger.uuid} %X{logger.fid} %X{logger.eid} %X{logger.comp} %X{logger.LocalTxnID}] - %X{logger.prefix} %m%n</Pattern>
                    </PatternLayout>
                    <Policies>
                            <SizeBasedTriggeringPolicy size="50 MB" />
                    </Policies>

            <DefaultRolloverStrategy max="200">


            <Delete basePath="/opt/mount1/sa-stats/log/local/fast/" maxDepth="1">
              <IfFileName glob="fast-*.log">
                  <IfAny>
                    <IfAccumulatedFileSize exceeds="10 GB" />
                    <IfAccumulatedFileCount exceeds="200" />
                    <IfLastModified age="10d" />
                  </IfAny>
              </IfFileName>
            </Delete>
          </DefaultRolloverStrategy>

            </RollingFile>

Regarding symlinks, The filebeat config uses a path which contains a symlink.

<stapp@g2t4454c:filebeat5> ll /opt/mount1
lrwxrwxrwx 1 root root 20 Aug 22 19:41 /opt/mount1 -> /usr/local/st/mount1

There is nothing special I found in the config. close_inactive is not defined which means the default of 5 minutes is used. Logrotation looks like this is the "normal" one.

Any chance that you could share the full log file of filebeat with me?
An other thing I spotted is mount1 in the filebeat config which for me indicates that you are reading logs from a shared drive?

About symlinks: I see, not the files itself are a symlinks but the directory with the files is a symlink. This should have no affect as for filebeat it looks like it reads the original file.

Hi.

mount1 is a not a shared drive. Here filebeat is running on a physical machine and mount1 is mount point for an LVM volume consisting of a single physical disk which is directly attached.

Is there a non-public way I can send you the logs?

@chris.mcdermott You can share it with me via email. Is the discussname@elastic.co

Sorry for being dense, but I can't figure out what the exact email address is.

I tried discussname@elastic.co and filebeat@elastic.co. Both bounced.

Third guess, beats@elastic.co, was the charm :slight_smile:

Or maybe not...

Hello chris.mcdermott@example.com,

We're writing to let you know that the group you tried to contact (beats) may not exist, or you may not have permission to post messages to the group. A few more details on why you weren't able to post:

  • You might have spelled or formatted the group name incorrectly.
  • The owner of the group may have removed this group.
  • You may need to join the group before receiving permission to post.
  • This group may not be open to posting.

If you have questions related to this or any other Google Group, visit the Help Center at https://support.google.com/a/elastic.co/bin/topic.py?topic=25838.

Thanks,

elastic.co admins

I think he meant ruflin @ elastic.co :slight_smile:

@chris.mcdermott Sorry for the puzzle :wink: It is the one that @tudor posted.

@chris.mcdermott Thanks for the log files. Looking at your log files shows me lots of sending errors. As long as events cannot be sent to the output, the harvester will keep the file open. A workaround here is using close_timeout but this will lead to data loss in your case.

More important seems to me to investigate why quite frequently the sending of events to logstash fails. Interesting is that not all events fail, but only a subset.

2016-10-11T13:40:05Z INFO Non-zero metrics in the last 30s: libbeat.logstash.published_but_not_acked_events=4096 libbeat.logstash.call_count.PublishEvents=4 libbeat.publisher.published_events=4037 registrar.writes=2 libbeat.logstash.published_and_acked_events=3980 libbeat.logstash.publish.write_bytes=826186 registrar.states.update=4096 publish.events=4096 libbeat.logstash.publish.read_bytes=210 libbeat.logstash.publish.read_errors=2

That means the connection works but for example Logstash cannot keep up with all the events or network is flaky. How many filebeat instances are their sending the events to Logstash?

There are a lot of filebeat instances sending to logstash. Notice the problem is that logstash is dropping the connection (see the EOF errors). Its the latest version of logstash and we had upped the client timeout to a pretty hight number (300s I believe), but for some reason it prematurely terminates the connections.

As I have the suspicion that the not closing of deleted files is related to the connection issue, I suggest we focus first on the connection issues. Do you see any errors / messages on the Logstash side? Could it be that Logstash cannot keep up with the load of all the filebeat instances sending data? How many events per second do you have on the LS side?

This topic was automatically closed after 21 days. New replies are no longer allowed.