Filebeat fails to harvest if a file and a symlink to that file is in the same directory

Hi,

It seems to me if symlink exists in the directory where filebeat should work on then filebeat won't harvest.
Our system is writing logs in a format where every hour we name the logfile with the actual hour in the filename and we have a symlink named with current which always points to the latest file like this:
log_2015-01-01_00001
log_2015-01-01_00002
current -> log_2015-01-01_0002

If I start filebeat and symlink is in the actual directory then filebeat fails to harvest any of the files.

I could reproduce this issue in the following way:

  1. Create a directory
  2. Put a logfile there (test.log for example)
  3. Create a symlink to that logfile at the same directory (test.current for example)
  4. Start filebeat and it won't harvest that file.

I tried it with filebeat-5.0.0-alpha2-x86_64 on Ubuntu 12.04.

I could not even find these files in the registry as well:
cat data/registry |grep test|wc -l
0

See my debug logs attached.

If I set an exclude on the symlink then everything works fine.

Here are the debug logs:
2016/05/11 08:56:37.311071 beat.go:276: INFO filebeat start running.
2016/05/11 08:56:37.311147 registrar.go:60: INFO Registry file set to: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/data/registry
2016/05/11 08:56:37.311214 registrar.go:70: INFO Loading registrar data from /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/data/registry
2016/05/11 08:56:37.311665 log.go:12: WARN client/brokers registered new broker #1003 at x.x.x.x
2016/05/11 08:56:37.311781 log.go:12: WARN client/brokers registered new broker #1002 at y.y.y.y
2016/05/11 08:56:37.311945 log.go:16: WARN kafka message: Successfully initialized new client
2016/05/11 08:56:37.312345 spooler.go:41: DBG Spooler will use the default spool_size of 2048
2016/05/11 08:56:37.312403 spooler.go:47: DBG Spooler will use the default idle_timeout of 5s
2016/05/11 08:56:37.312531 crawler.go:37: INFO Loading Prospectors: 1
2016/05/11 08:56:37.312576 crawler.go:42: DBG File Configs: [/opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/]
2016/05/11 08:56:37.312625 prospector.go:241: INFO Set ignore_older duration to 0
2016/05/11 08:56:37.312666 prospector.go:241: INFO Set scan_frequency duration to 10s
2016/05/11 08:56:37.312723 prospector.go:177: INFO buffer_size set to: 16384
2016/05/11 08:56:37.312791 prospector.go:190: INFO input_type set to: log
2016/05/11 08:56:37.312827 prospector.go:241: INFO Set backoff duration to 1s
2016/05/11 08:56:37.312856 prospector.go:201: INFO backoff_factor set to: 2
2016/05/11 08:56:37.312886 prospector.go:241: INFO Set max_backoff duration to 10s
2016/05/11 08:56:37.312932 prospector.go:211: INFO force_close_file is disabled
2016/05/11 08:56:37.312964 prospector.go:241: INFO Set close_older duration to 1h0m0s
2016/05/11 08:56:37.312993 prospector.go:222: INFO max_bytes set to: 10485760
2016/05/11 08:56:37.313075 prospector_log.go:40: DBG exclude_files: [.gz$]
2016/05/11 08:56:37.313116 prospector_log.go:79: DBG scan path /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/

2016/05/11 08:56:37.313281 prospector_log.go:92: DBG Check file for harvesting: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.current
2016/05/11 08:56:37.313295 spooler.go:75: INFO Starting spooler: spool_size: 2048; idle_timeout: 5s
2016/05/11 08:56:37.313522 prospector_log.go:150: DBG Start harvesting unknown file: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.current
2016/05/11 08:56:37.313674 registrar.go:174: INFO New file. Start reading from the beginning: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.current
2016/05/11 08:56:37.313729 prospector_log.go:263: DBG Start / resuming harvester of file: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.current
2016/05/11 08:56:37.313784 prospector_log.go:92: DBG Check file for harvesting: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.log
2016/05/11 08:56:37.313837 prospector_log.go:150: DBG Start harvesting unknown file: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.log
2016/05/11 08:56:37.313711 registrar.go:77: INFO Starting Registrar
2016/05/11 08:56:37.313951 registrar.go:174: INFO New file. Start reading from the beginning: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.log
2016/05/11 08:56:37.314127 prospector_log.go:251: DBG Launching harvester on renamed file. File rename was detected: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.current -> /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.log
2016/05/11 08:56:37.314217 log.go:227: DBG harvest: "/opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.current" (offset snapshot:0)
2016/05/11 08:56:37.314287 log.go:41: INFO Harvester started for file: /opt/filebeat/filebeat-5.0.0-alpha2-x86_64/test/test.current
2016/05/11 08:56:42.313444 spooler.go:132: DBG Flushing spooler because of timeout. Events flushed: 0

Thanks,
Tamas

@treff7es1 Thanks for the report. I will have a closer look at this.

1 Like

@treff7es1 I started to investigate this issue. The first thing I asked myself is what do we expect if there is a symlink to a file which is also in the same directory? Does it just add the orginal to our harvester list? Should we even "respect" symlinks or ignore them?

I couldn't reproduce the behaviour about not reading at all, but it currently reads the file twice (which is somehow expected as it treats the symlink like a file). Each symlink has its own inode ...

@ruflin I would expect to harvest only the file where the symlink points and not add the symlink itself. I think that would be a great idea if it could be set if symlink should be skipped.

I will try the usecase again on my side to double check the not harvesting case.

@treff7es1 I opened a Github issue here with this: https://github.com/elastic/beats/issues/1686 Lets move our conversation there.

1 Like