Hi,
We've an ELK setup that we recently introduced filebeat too. Part of our testing was to stop the logstash instances that filebeat writes to for a while to backlog some data to confirm that it cause no issues and that we could clear the backlog without any problem.
On one type of our hosts, we rotate log files at 100MB in size. We keep 12 files before the oldest is removed. It usually takes around 6 hours for a file to be deleted. We stop our logstash instances for about 5 hours and turned them back on. What happened next caused a bit of a problem on these types of hosts.
From the logs:
2017-06-06T16:58:31+01:00 INFO No non-zero metrics in the last 30s
2017-06-06T16:58:49+01:00 ERR Connect failed with: dial tcp 172.18.2.23:5044: getsockopt: connection refused
2017-06-06T16:59:01+01:00 INFO No non-zero metrics in the last 30s
2017-06-06T16:59:29+01:00 ERR Connect failed with: dial tcp 172.18.2.36:5044: getsockopt: connection refused
2017-06-06T16:59:31+01:00 INFO No non-zero metrics in the last 30s
2017-06-06T16:59:49+01:00 ERR Connect failed with: dial tcp 172.18.2.23:5044: getsockopt: connection refused
2017-06-06T17:00:01+01:00 INFO No non-zero metrics in the last 30s
2017-06-06T17:00:31+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=10609
2017-06-06T17:00:35+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue-console.log
2017-06-06T17:00:35+01:00 INFO File is falling under ignore_older before harvesting is finished. Adjust your close_* settings: /var/log/Application/production-Application-blue/structured-2017-06-06.21.json
2017-06-06T17:00:35+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.22.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.32.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.33.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.25.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.28.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.24.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.27.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.26.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.34.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.30.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.31.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.29.json
2017-06-06T17:01:01+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=80979 filebeat.harvester.open_files=11 filebeat.harvester.running=13 publish.events=296 libbeat.logstash.published_and_acked_events=296 registrar.writes=3 filebeat.harvester.started=13 libbeat.logstash.call_count.PublishEvents=3 libbeat.publisher.published_events=2246 libbeat.logstash.publish.read_bytes=36 registrar.states.cleanup=9 registar.states.current=-9 registrar.states.update=296
So upon connecting it started harvesters for all the files as I think is expected. At this point every other VM is also backlogged by 5 hours, so there's quite alot of logs to harvest over the whole estate (600+ VMs I think).
Soon after, filebeat started logging this sort of thing:
2017-06-06T17:03:30+01:00 INFO File is falling under ignore_older before harvesting is finished. Adjust your close_* settings: /var/log/Application/production-Application-blue/structured-2017-06-06.24.json
This is intentional, we concluded that it's totally fine to start ignoring log files after 4 hours if for some reason they're not able to be processed for some reason. This setting, along with close_removed, we believed (from reading the documentation) would be enough to stop filebeat hanging onto these files in the sort of scenario we had just created.
However, that doesn't appear to be the case. Over the coming 12 hours, filebeat held the 12 or so above mentioned files open and continued harvesting them despite the files having been deleted as part of our log rotation process.
So how is close_removed supposed to work?
The docs specifically say:
When this option is enabled, Filebeat closes the harvester when a file is removed.
We're currently running filebeat 5.2.2 and I looked at the release notes of everything up to 6.0.0-alpha2 to see if there was mention of anything suggesting that an issue with close_remove had been fixed. But I can't find anything even remotely similar. Which has made me wonder if i'm actually totally misunderstanding the documentation?
Any help is greatly appreciated.
Thanks