File changes not detected after a while

I am checking the loadbalancer feature, in my test seems like filebeat stalls after a while.
I have an elk cluster with 4 LS nodes. I have a loop generating msgs to /var/log/messages constantly, and when I start filebeat it captures all new changes at the moment and submits it to LS, but after that happens it basically stalls until I reboot it, even if I take the LS it connected down.

Am I missing something ?

Here is my configuration:

filebeat:
spool_size: 5
prospectors:
-
paths:
- "/var/log/*.log"
- /var/log/messages
type: log
scan_frequency: 5s
idle_timeout: 10s

output:
logstash:
enabled: true
hosts:
- logstash.example.com:5044
loadbalance: true

@jhidalgo Config looks good. Which system are you running it on? Can you start filebeat with -d "*" (debug output) option an post some log entries here as soon as filebeat stalls?

I see your config having only one logstash instance configured. In this case, the load balancing mode will not be loaded/initialized. You need to have at least two instances configured. If filebeat stalls with only one host configured issue is not related to load-balancing node. But still, filebeat might stale, if all logstash instances known become unavailable. Have you tried running with -d '*,publish' ? Would be interesting to see some debug/error output.

Plus load balancing in filebeat is at the moment limited due filebeat waiting for ok from logstash in order to update some internal state on disk to deal with filebeat restarts. Still some remains originating from logstash-forward which might be addressed in the future. Related filebeat issue: https://github.com/elastic/filebeat/issues/106

  1. Load Balancing:
    I am using a DNS entry that responds with 4 A records. From an administrative point of view it is much better to add and remove IPs at the DNS level than changing configuration files for multiple services. I am assuming FileBeat will consider those multiple IPs, but please let me know if that's not the case.

  2. File changes not detected. I am monitoring /var/log/messages with FB, and creating new entries with logger, but this is what the -d "*,publish" shows:
    prospector.go:163: DBG Check file for harvesting: /var/log/messages
    prospector.go:280: DBG Update existing file for harvesting: /var/log/messages
    prospector.go:322: DBG Not harvesting, file didn't change: /var/log/messages
    output.go:37: DBG output worker: no events to publish
    output.go:37: DBG output worker: no events to publish

No matter how many lines I append to /var/log/messages, those changes are not detected by FB.

  1. OS:
    CentOS Linux release 7.1.1503 (Core)
    Derived from Red Hat Enterprise Linux 7.1 (Source)
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"

  2. Without considering load balancing, when FB starts it finds new entries and submits them to LS, but once all new entries have been found it seems to stall, not detecting or submitting new entries as I would expect considering the configuration settings.

Thanks for your support

filebeat will get list of ips for every logstash host configured and chooses on IP by random retrying if connection fails.

One more question: You using network shares or run in VM with sharing enabled?

AWS EC2 instance without network shares

@jhidalgo Have a look at https://github.com/elastic/filebeat/issues/167 Do you have some encoding configured?

I'm not sure if this is related - happy to start a new thread if not.

I run a pipeline latency test for my ELK cluster; writing timestamp info to a file every second, processing it at each Logstash node to calculate the delay at that node, and graphing the results when they arrive at ES.

I've noticed that filebeat is not always detecting the log file has updated, even with scan_frequency set at 1s. Instead it's deciding the file hasn't changed, and then periodically sends batches of messages to catch up.

As you can imagine this is terrible for a latency test - the result is irregular sawtooth graphs.

I've pasted a debug output of the issue at http://pastebin.com/uwUuE00k

Note that the log file is indeed being updated every second. The script writing it is just a simple "echo / sleep 1" type of thing, and I've tailed it during this test to double check.

I'm also sure the scan_frequency setting doing what it says - the "Not harvesting, file didn't change" events occur every second. The problem appears to be that it's not detecting the log file has changed at every interval.

Edit: I should point out that this happens immediately on startup, not after any particular waiting period.

Hi @ceekay, as part of this PR we just merged, a configurable parameter was introduced to control how often we check an open file for new writes. It's called backoff and the default is 1s.

This change has sped up significantly our automated tests, so I have the feeling it will also help with the issue you are seeing. We'll have the option in RC1 that we are currently working on releasing, but I can also trigger a nightly build if you want to test before that.

Edit: actual link to the PR and some wording.

@ceekay filebeat has an internal spooler collecting lines from multiple events before publishing (think collecting batches).

See config options spool_size and idle_timeout to tune the spooler. The smaller the batches the more processing overhead, potentially.

@tudor @steffens

Thanks for the updates. This sounds like it will resolve my issue - I'll check and let you know.

I understand the overhead increase for a very small spooler, but this case is specifically for pipeline latency monitoring and is quite low volume - one message per second, on one host.

@steffens @ruflin I'm seeing the same behaviour as @jhidalgo - filebeat works as expected for a while, but then suddenly stops detecting log changes. After that, it continues to scan the files, but decides: "Not harvesting, file didn't change".

I'm using current head from the filebeat repo, because I needed the force_close_on_windows_files option. OS is Windows Server 2012 R2.

The prospectors are configured like this:

filebeat:
  prospectors:
    -
      paths:
        - c:\server\server1.log
      type: log
      fields:
        type: foo
      scan_frequency: 3s
      tail_on_rotate: false
      force_close_windows_files: true  
    -
      paths:
        - c:\server\server2.log
      type: log
      fields:
        type: foo
      scan_frequency: 3s
      tail_on_rotate: false
      force_close_windows_files: true

  # General filebeat configuration options
  registry_file: "C:/ProgramData/Filebeat/registry"

@steffens

Setting spool_size to 1 immediately fixed my issue. Max latency time at the first node has reduced from ~8 seconds to 1 second, and min latency is ~0.05 secs, which is about the same as I was seeing before testing filebeat.

With spool_size at 1, I don't see much point in changing idle_timeout, as the output will fire as soon as an event is available.

Thanks very much!

@lhm @jhidalgo I think I found the same problem. Take a look at this issue.

https://github.com/elastic/filebeat/pull/199