We've discovered an issue where Filebeat 6.3.1 causes multiple entries of the same file in the registry when starting for the first time after updating from 6.3.0. It also seems to re-ship the logs from the affected file. The registry entries are mostly identical, with the small difference of the meta field (null vs {}):
Right now we're planning on cleaning up the registries via script and moving on. I'm mostly wondering if there's something we could/should have done to prevent this? I checked the breaking changes doc and didn't see anything related.
Interesting background: in the 120 servers that we have deployed filebeat 6.3.1 to this happened to only 9 of them, and even on those it was typically between 1 and 10 files of the at least 20 files being tailed. I assume it's a pretty rare race condition.
Here's a log snippet from when this occurs. Filebeat claims to load 18 states from the registry, then proceeds to start harvesters for all 18 of the files (despite them having been fully up-to-date). Those harvesters start from 0 and re-ship the whole file.
2018-07-16T17:57:23.479-0700 INFO registrar/registrar.go:116 Loading registrar data from /vol/filebeat-kafka-raw/lib/registry
2018-07-16T17:57:23.479-0700 INFO registrar/registrar.go:127 States Loaded from registrar: 18
2018-07-16T17:57:23.479-0700 WARN beater/filebeat.go:354 Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2018-07-16T17:57:23.479-0700 INFO crawler/crawler.go:48 Loading Inputs: 1
2018-07-16T17:57:23.479-0700 INFO [monitoring] log/log.go:97 Starting metrics logging every 30s
2018-07-16T17:57:23.481-0700 INFO log/input.go:113 Configured paths: [/vol/log/tsar/**.log]
2018-07-16T17:57:23.481-0700 INFO input/input.go:88 Starting input of type: log; ID: 11910744973074344246
2018-07-16T17:57:23.481-0700 INFO crawler/crawler.go:82 Loading and starting Inputs completed. Enabled inputs: 1
2018-07-16T17:57:23.482-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180712112148.log
2018-07-16T17:57:23.482-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180706183153.log
2018-07-16T17:57:23.482-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180715080223.log
2018-07-16T17:57:23.482-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180713214813.log
2018-07-16T17:57:23.482-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180711005710.log
2018-07-16T17:57:23.482-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180701010923.log
2018-07-16T17:57:23.483-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180708044613.log
2018-07-16T17:57:23.483-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180716172421.log
2018-07-16T17:57:23.484-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/twistd-20180629145322.log
2018-07-16T17:57:23.484-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zrc-root-20180629151701.log
2018-07-16T17:57:23.484-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zrc-root-service-20180630110001.log
2018-07-16T17:57:23.484-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zrc-root-zr-tmpreaper-20180630050003.log
2018-07-16T17:57:23.484-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zrc-zr-20180630010502.log
2018-07-16T17:57:23.484-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180702113513.log
2018-07-16T17:57:23.484-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180705081203.log
2018-07-16T17:57:23.485-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zrc-root-command-20180629144730.log
2018-07-16T17:57:23.485-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180703220300.log
2018-07-16T17:57:23.488-0700 INFO log/harvester.go:228 Harvester started for file: /vol/log/tsar/zookeeper-20180709144543.log
On a single host in dev I can see that we installed filebeat 6.2.4 on May 15, then upgraded from 6.2.4 to 6.3.1 on the 16th, then downgraded back to 6.3.0 as jdon mentioned.
In prod I can see a host that we started with 6.2.1 on Feb 15, then upgraded to 6.2.2 on March 1, then upgraded to 6.2.3 on April 12, 6.2.3 to 6.2.4 on April 26, 6.2.4 to 6.3.0 on June 21, and 6.3.0 to 6.3.1 on July 17. Then downgraded back to 6.3.0 as jdon mentioned.
Note that both of these hosts exhibited the issue.
Hi @tudor, thanks for the workaround! Unfortunately, we did try that as part of the upgrade process, and 6.3.1 still exhibited the bug. This is what we did:
We also thought this would work, and were surprised when it didn't. Does Filebeat care about the formatting of the json, or something?
* The python script, filebeat-registry-add-meta-field.py:
import json
registry_paths = ["/vol/filebeat-kafka-raw/lib/registry", "/vol/filebeat-kafka-jsonwrap/lib/registry"]
states = {}
for path in registry_paths:
with open(path) as f:
registry = json.loads(f.read())
for state in registry:
state['meta'] = {}
source = state['source']
# keep the larger offset (don't re-ship logs)
if source in states:
if states[source]["offset"] < state["offset"]:
states[source] = state
else:
states[source] = state
registry = list(states.values())
with open(path, 'w') as f:
f.write(json.dumps(registry))
The script looks good, thanks. To clarify, the posted registry file (https://gist.github.com/frioux/afdaa41b9197d32d840cbcc7310bde40) doesn't contain any meta keys. Is it perhaps the version after downgrading, or after cleanup? Could we have a registry files that shows the meta: null situation?
Thank you @jdonahue and @frioux for the help in debugging this! We have added code to recover from that situation and reviewed the code paths to make sure meta: nill can't happen and even if it happens it's still handled the same as meta: {}. I'm still a bit worried that the workaround didn't work for you, but I think we took enough preventive measures that this won't happen in the future.
The fix is being included in 6.3.2 which we expect to release sometime next week.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.