UPDATE: This discussion is being tracked as bug #3098
UPDATE: Adding the affected OS as CentOS 6.x (2.6.32-642.3.1.el6.x86_64)
Per the title, Filebeat (5.0.0) appears to stop updating the registry, breaking state durability. I'll provide some additional context below, as well as any configuration/logs/whatever that might be helpful for the development team.
Note: I am aware of this thread, which highlights a similar issue. I chose to create a new thread because I didn't want to complicate that one any further.
The background:
General
- The issue appears on CentOS 6.x hosts (kernel: 2.6.32-642.3.1.el6.x86_64)
- We recently upgraded to Filebeat 5.0.0 (from 1.3.1) on a limited subset of production hosts.
- We largely maintained our previous configuration, with the addition of some clean_* flags.
- We ship through a Logstash cluster (logstash @ 2.2.1, logstash-input-beats @ 2.1.3)
- We index into ElasticSearch (2.2.2)
- On peak days, we push ~1B documents per day, with instantaneous peaks of ~60K documents a second.
Host Specific
- The example host referenced below is primarily used to run Mesos.
- The process spits out a lot (hundreds per day) of relatively small (a few MBs) logs, each with a unique name.
- The filebeat prospectors use fairly broad globs to find the files, all of which were in place prior to restarting the Filebeat process after the upgrade.
- The registry file was removed as part of the upgrade, so Filebeat was given a fresh start.
The symptoms:
For some period of time, all appears to be well -- filebeat is shipping, and the registry is updating. At some point, metrics related to registry updates stop appearing in the logs, although log lines appear to continue shipping. Somewhat afterwards, it appears that registry file is no longer being updated (based on update timestamp). I have not directly confirmed it, but I strongly suspect a restart would cause those lines to be shipped again.
Timeline:
- 2016-11-29 11:12:16-05:00 : Filebeat process is restarted
- 2016-11-29 11:17:16-05:00 : Filebeat registry metrics last seen in logs
- 2016-11-29 11:34:00-05:00 : Filebeat registry file last updated on file system
- 2016-11-29 16:35:46-05:00 : Filebeat continues to ship log lines (as seen in filebeat logs, elasticsearch)
The configuration:
/etc/filebeat/filebeat.yml
---
filebeat:
config_dir: /etc/filebeat/conf.d
idle_timeout: 5s
publish_async: true
registry_file: /var/lib/filebeat/registry
spool_size: 2000
logging:
files:
keepfiles: 5
name: filebeat.log
path: /var/log/filebeat
rotateeverybytes: 10485760
level: info
to_files: true
to_syslog: false
output:
logstash:
bulk_max_size: 2000
compression_level: 3
hosts: ['xxx-xxx-006','xxx-xxx-001','xxx-xxx-002','xxx-xxx-003','xxx-xxx-004','xxx-xxx-005']
loadbalance: true
max_retries: -1
worker: 4
/etc/filebeat/conf.d/mesos-task-json.yml
---
filebeat:
prospectors:
-
document_type: service-json
paths:
- /mesos_work_dir/slaves//frameworks//executors//runs/latest/service-logstash.log
- /var/lib/mesos/slaves//frameworks//executors//runs/latest/service-logstash.log
ignore_older: 24h
clean_inactive: 48h
scan_frequency: 30s
Other (Possible) Notables
- The configuration uses both async publishing, and load balancing. In the other, related thread, there was some speculation that this might be a 'bad' combination. While I do think it would be safe to pull back on async publishing across most of our log sources, we have some point sources that absolutely require it to maintain sufficient throughput.
- The registry, when last written, was well over 6MB, tracking hundreds of files. I can make this available to you, if it would be helpful.
- The logs continue to show basic activity continuing, just no registry updates. I could also make these available, if helpful.
- Spot checks show that every machine in this class have exhibited the same behavior, with almost the same timing.
- Our staging environment does not exhibit this behavior. The differences? Smaller logstash cluster (2), fewer workers, no load balancing, and no async publishing.
- Our logs show a large number of the "File is falling under ignore_older before harvesting is finished. Adjust your close_* settings" messages. These are presumably for pre-existing files that were already beyond the ignore window. This would seem like a bug to me (related?), but I could also be misinterpreting something.
Any help is appreciated. We are very much hoping to upgrade to Filebeat to take advantage of the updated clean_* / close_* features. The lack of these features in previous versions caused us quite a bit of grief, and we are anxious to put that behind us.