Close_removed not working properly?

Hi,

We've an ELK setup that we recently introduced filebeat too. Part of our testing was to stop the logstash instances that filebeat writes to for a while to backlog some data to confirm that it cause no issues and that we could clear the backlog without any problem.

On one type of our hosts, we rotate log files at 100MB in size. We keep 12 files before the oldest is removed. It usually takes around 6 hours for a file to be deleted. We stop our logstash instances for about 5 hours and turned them back on. What happened next caused a bit of a problem on these types of hosts.

From the logs:
2017-06-06T16:58:31+01:00 INFO No non-zero metrics in the last 30s
2017-06-06T16:58:49+01:00 ERR Connect failed with: dial tcp 172.18.2.23:5044: getsockopt: connection refused
2017-06-06T16:59:01+01:00 INFO No non-zero metrics in the last 30s
2017-06-06T16:59:29+01:00 ERR Connect failed with: dial tcp 172.18.2.36:5044: getsockopt: connection refused
2017-06-06T16:59:31+01:00 INFO No non-zero metrics in the last 30s
2017-06-06T16:59:49+01:00 ERR Connect failed with: dial tcp 172.18.2.23:5044: getsockopt: connection refused
2017-06-06T17:00:01+01:00 INFO No non-zero metrics in the last 30s
2017-06-06T17:00:31+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=10609
2017-06-06T17:00:35+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue-console.log
2017-06-06T17:00:35+01:00 INFO File is falling under ignore_older before harvesting is finished. Adjust your close_* settings: /var/log/Application/production-Application-blue/structured-2017-06-06.21.json
2017-06-06T17:00:35+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.22.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.32.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.33.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.25.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.28.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.24.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.27.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.26.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.34.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.30.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.31.json
2017-06-06T17:00:50+01:00 INFO Harvester started for file: /var/log/Application/production-Application-blue/structured-2017-06-06.29.json
2017-06-06T17:01:01+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=80979 filebeat.harvester.open_files=11 filebeat.harvester.running=13 publish.events=296 libbeat.logstash.published_and_acked_events=296 registrar.writes=3 filebeat.harvester.started=13 libbeat.logstash.call_count.PublishEvents=3 libbeat.publisher.published_events=2246 libbeat.logstash.publish.read_bytes=36 registrar.states.cleanup=9 registar.states.current=-9 registrar.states.update=296

So upon connecting it started harvesters for all the files as I think is expected. At this point every other VM is also backlogged by 5 hours, so there's quite alot of logs to harvest over the whole estate (600+ VMs I think).

Soon after, filebeat started logging this sort of thing:
2017-06-06T17:03:30+01:00 INFO File is falling under ignore_older before harvesting is finished. Adjust your close_* settings: /var/log/Application/production-Application-blue/structured-2017-06-06.24.json

This is intentional, we concluded that it's totally fine to start ignoring log files after 4 hours if for some reason they're not able to be processed for some reason. This setting, along with close_removed, we believed (from reading the documentation) would be enough to stop filebeat hanging onto these files in the sort of scenario we had just created.

However, that doesn't appear to be the case. Over the coming 12 hours, filebeat held the 12 or so above mentioned files open and continued harvesting them despite the files having been deleted as part of our log rotation process.

So how is close_removed supposed to work?

The docs specifically say:
When this option is enabled, Filebeat closes the harvester when a file is removed.

We're currently running filebeat 5.2.2 and I looked at the release notes of everything up to 6.0.0-alpha2 to see if there was mention of anything suggesting that an issue with close_remove had been fixed. But I can't find anything even remotely similar. Which has made me wonder if i'm actually totally misunderstanding the documentation?

Any help is greatly appreciated.

Thanks

Can you share your full config file?

When the output is blocking, it becomes a little bit more complex. In general your assumption is correct. The file should be closed as soon as filebeat can send one more event and during trying to fetch the next line, it will detect that the log files was actually removed and will not continue. It will not close as long as the output is blocked. If you want to force free the filebeat handlier, you need to you close_timeout. Could it be that filebeat is busy with reading all the new files that it didn't catch up on all the old files? Did it ever send new events from the files which were kept open?

Sure.

/etc/filebeat/filebeat.yml:

filebeat.spool_size: 2048
filebeat.publish_async: false
filebeat.idle_timeout: 5s
filebeat.registry_file: .filebeat
filebeat.config_dir: /etc/filebeat/conf.d
filebeat.shutdown_timeout: 0
name: production-app-001.domain.tld
fields_under_root: false
queue_size: 1000
max_procs: 
output.logstash:
  hosts:
    - logstash-receiver-001.domain.tld:5044
    - logstash-receiver-002.domain.tld:5044
  loadbalance: true

Then /etc/filebeat/conf.d/app.yml
---
filebeat:
prospectors:
- input_type: log
paths:
- /var/log/Application//structured-.json
encoding: plain
fields:
'@application': Application
'@environment': production
'@group': blue
fields_under_root: true
tags:
- app
ignore_older: 4h
document_type: app_logger
scan_frequency: 10s
harvester_buffer_size: 16384
max_bytes: 10485760
tail_files: false
backoff: 1s
max_backoff: 10s
backoff_factor: 2
close_inactive: 5m
close_renamed: false
close_removed: true
close_eof: false
clean_inactive: 4h1m
clean_removed: true
close_timeout: 0

Is there a way that I can tell whether the output was blocked at this time? I don't remember seeing anything in the filebeat log itself otherwise I would hope tat I would have included it.

I would guess it was the case to some extent. Our logstash instances that were receiving messages from filebeat had been down for almost 5 hours... upon starting them up again there's 6-800 servers and VM's that started sending their logs in again.

I did see events from this file. As mentioned originally, I definitely saw info from the one specific log file I mentioned about 12 hours late, so, 7 hours or so after the logstash instances were started again. I'm unsure if I can check whether there were any in between those two times or how much of that specific file made it into the logging system as I think the original file and the data in elasticsearch has now been removed as we only keep some data for 14 days.

The documentation for close_timeout suggests that it also has the same problem as I think you're suggesting we have for close_removed?

The close_timeout setting won’t apply if your output is stalled and no further events can be sent. At least one event must be sent after close_timeout elapses so the harvester can be closed after sending the event.

So even if I used that, assuming that the output is blocking, I would still be in the same position wouldn't I?

For close_timeout: Since 5.3 it also works on blocked outputs. Unfortunately you are right and the docs here are not up-to-date :frowning: @dedemorton FYI

I assume there's no way that close_removed can be changed to work if an output is blocked too?

Not at the moment. With 6.0 there is a publisher refactoring which in the long term could potentially enable this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.