Filebeat stops updating the registry

UPDATE: This discussion is being tracked as bug #3098

UPDATE: Adding the affected OS as CentOS 6.x (2.6.32-642.3.1.el6.x86_64)

Per the title, Filebeat (5.0.0) appears to stop updating the registry, breaking state durability. I'll provide some additional context below, as well as any configuration/logs/whatever that might be helpful for the development team.

Note: I am aware of this thread, which highlights a similar issue. I chose to create a new thread because I didn't want to complicate that one any further.

The background:

General

  • The issue appears on CentOS 6.x hosts (kernel: 2.6.32-642.3.1.el6.x86_64)
  • We recently upgraded to Filebeat 5.0.0 (from 1.3.1) on a limited subset of production hosts.
  • We largely maintained our previous configuration, with the addition of some clean_* flags.
  • We ship through a Logstash cluster (logstash @ 2.2.1, logstash-input-beats @ 2.1.3)
  • We index into ElasticSearch (2.2.2)
  • On peak days, we push ~1B documents per day, with instantaneous peaks of ~60K documents a second.

Host Specific

  • The example host referenced below is primarily used to run Mesos.
  • The process spits out a lot (hundreds per day) of relatively small (a few MBs) logs, each with a unique name.
  • The filebeat prospectors use fairly broad globs to find the files, all of which were in place prior to restarting the Filebeat process after the upgrade.
  • The registry file was removed as part of the upgrade, so Filebeat was given a fresh start.

The symptoms:

For some period of time, all appears to be well -- filebeat is shipping, and the registry is updating. At some point, metrics related to registry updates stop appearing in the logs, although log lines appear to continue shipping. Somewhat afterwards, it appears that registry file is no longer being updated (based on update timestamp). I have not directly confirmed it, but I strongly suspect a restart would cause those lines to be shipped again.

Timeline:

  • 2016-11-29 11:12:16-05:00 : Filebeat process is restarted
  • 2016-11-29 11:17:16-05:00 : Filebeat registry metrics last seen in logs
  • 2016-11-29 11:34:00-05:00 : Filebeat registry file last updated on file system
  • 2016-11-29 16:35:46-05:00 : Filebeat continues to ship log lines (as seen in filebeat logs, elasticsearch)

The configuration:

/etc/filebeat/filebeat.yml
---
filebeat:
config_dir: /etc/filebeat/conf.d
idle_timeout: 5s
publish_async: true
registry_file: /var/lib/filebeat/registry
spool_size: 2000
logging:
files:
keepfiles: 5
name: filebeat.log
path: /var/log/filebeat
rotateeverybytes: 10485760
level: info
to_files: true
to_syslog: false
output:
logstash:
bulk_max_size: 2000
compression_level: 3
hosts: ['xxx-xxx-006','xxx-xxx-001','xxx-xxx-002','xxx-xxx-003','xxx-xxx-004','xxx-xxx-005']
loadbalance: true
max_retries: -1
worker: 4

/etc/filebeat/conf.d/mesos-task-json.yml
---
filebeat:
prospectors:
-
document_type: service-json
paths:
- /mesos_work_dir/slaves//frameworks//executors//runs/latest/service-logstash.log
- /var/lib/mesos/slaves/
/frameworks//executors//runs/latest/service-logstash.log
ignore_older: 24h
clean_inactive: 48h
scan_frequency: 30s

Other (Possible) Notables

  • The configuration uses both async publishing, and load balancing. In the other, related thread, there was some speculation that this might be a 'bad' combination. While I do think it would be safe to pull back on async publishing across most of our log sources, we have some point sources that absolutely require it to maintain sufficient throughput.
  • The registry, when last written, was well over 6MB, tracking hundreds of files. I can make this available to you, if it would be helpful.
  • The logs continue to show basic activity continuing, just no registry updates. I could also make these available, if helpful.
  • Spot checks show that every machine in this class have exhibited the same behavior, with almost the same timing.
  • Our staging environment does not exhibit this behavior. The differences? Smaller logstash cluster (2), fewer workers, no load balancing, and no async publishing.
  • Our logs show a large number of the "File is falling under ignore_older before harvesting is finished. Adjust your close_* settings" messages. These are presumably for pre-existing files that were already beyond the ignore window. This would seem like a bug to me (related?), but I could also be misinterpreting something.

Any help is appreciated. We are very much hoping to upgrade to Filebeat to take advantage of the updated clean_* / close_* features. The lack of these features in previous versions caused us quite a bit of grief, and we are anxious to put that behind us.

1 Like

I fully second this request. As I authored several other tickets about registry problems I can state that:

  • keeping loadbalance: true but disabling publish_async helps
  • this combo is only problematic in very recent versions of Elastic Stack; even early 5.0 alphas of FB and LS weren't affected

In the vast majority of cases, I would be able to disable async publishing. However...

  1. There are a handful of (important!) hosts where throughput testing shows that async publishing is necessary in order to keep up with logging rates. Disabling async would create significant indexing lag.

  2. The async + load balancing option is presented here as a supported option, assuming you can manage the stated caveats. If that option is currently experiencing issues in the current 5.0.x releases, I imagine a growing number of folks are going to bump into it.

@ruflin @steffens You have mentioned some concerns about the behavior of filebeat when using load balance + async publish in current 5.0.x releases (i.e. 5.0.0, 5.0.1). Do you recommend that we avoid that combination for now? Is there any additional information that might help you make the determination?

I have created a gist with the filebeat log, and current registry.

@tom.dewire Thanks a lot for all the details. The current short answer is to disable publish_async and the problem should go away.

We are not sure yet what the actual cause of the problem is. It could be related to changes we made on the filebeat side but also on the beats-input-plugin. Which version of the plugin are you using? Do you know if around the time the problem starts, if some of the Logstash instances are not available or there were some potential network issues?

The current goal is find a reliable way to reproduce it so we can dig deeper into the problem and fix it. @tom.dewire @shog Could one of you open a github issue for the problem as I think we can now definitively say there is potentially somewhere a bug but we don't know yet where exactly.

@tom.dewire I would expect that have higher # of events in the 5.0 release then 1.x, potentially also with async disabled.

Thanks for the follow up, and for the definitive statement around public_async. I'll proceed with that in mind.

The relevant versions for logstash and the beats plugin are in my initial post. If there is any additional advice relevant to those versions, please let me know. It might be worth mentioning that we are planning on upgrading the entire pipeline in the near future. The filebeat upgrade was the first step in that process.

I would be happy to create the github issue.

The related issue is #3098.

@ruflin Regarding your question about any environmental (e.g. network) issues around the time of the incident... I have no evidence of such an event. General network issues would have manifested themselves elsewhere in our application and monitoring, and there was no sign of trouble during the relevant window.

Thanks for the issue. We will further investigate and ping you guys in case we need more details or found a solution.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.