Filebeat 6.3.1 update caused multiple entries in the registry for the same file, logs re-shipped


(Jeremy Donahue) #1

Hi there,

We've discovered an issue where Filebeat 6.3.1 causes multiple entries of the same file in the registry when starting for the first time after updating from 6.3.0. It also seems to re-ship the logs from the affected file. The registry entries are mostly identical, with the small difference of the meta field (null vs {}):

$ jq '.[] | select(.source == "/path/to/logs/file.log")' registry
{
  "source": "/path/to/logs/file.log",
  "offset": 52428808,
  "timestamp": "2018-07-16T11:47:59.444034309-07:00",
  "ttl": -1,
  "type": "log",
  "meta": {},
  "FileStateOS": {
    "inode": 268767219,
    "device": 64512
  }
}
{
  "source": "/path/to/logs/file.log",
  "offset": 25287513,
  "timestamp": "2018-07-16T11:47:59.444085182-07:00",
  "ttl": -1,
  "type": "log",
  "meta": null,
  "FileStateOS": {
    "inode": 268767219,
    "device": 64512
  }
}

Note that Filebeat considers the first entry fully shipped, and has completely abandoned the second entry. The device and inode are the same.

The Filebeat config:

filebeat.prospectors:
  - type: log
    paths:
      - /path/to/logs/**.log
    close_inactive: 5m
    close_removed: true

filebeat.shutdown_timeout: 10s
filebeat.registry_flush: 30s

output.kafka:
  <unrelated>

Right now we're planning on cleaning up the registries via script and moving on. I'm mostly wondering if there's something we could/should have done to prevent this? I checked the breaking changes doc and didn't see anything related.

Thanks,
Jeremy Donahue


(fREW Schmidt) #2

Interesting background: in the 120 servers that we have deployed filebeat 6.3.1 to this happened to only 9 of them, and even on those it was typically between 1 and 10 files of the at least 20 files being tailed. I assume it's a pretty rare race condition.


(Jeremy Donahue) #3

Here's a log snippet from when this occurs. Filebeat claims to load 18 states from the registry, then proceeds to start harvesters for all 18 of the files (despite them having been fully up-to-date). Those harvesters start from 0 and re-ship the whole file.

2018-07-16T17:57:23.479-0700    INFO    registrar/registrar.go:116      Loading registrar data from /vol/filebeat-kafka-raw/lib/registry
2018-07-16T17:57:23.479-0700    INFO    registrar/registrar.go:127      States Loaded from registrar: 18
2018-07-16T17:57:23.479-0700    WARN    beater/filebeat.go:354  Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2018-07-16T17:57:23.479-0700    INFO    crawler/crawler.go:48   Loading Inputs: 1
2018-07-16T17:57:23.479-0700    INFO    [monitoring]    log/log.go:97   Starting metrics logging every 30s
2018-07-16T17:57:23.481-0700    INFO    log/input.go:113        Configured paths: [/vol/log/tsar/**.log]
2018-07-16T17:57:23.481-0700    INFO    input/input.go:88       Starting input of type: log; ID: 11910744973074344246 
2018-07-16T17:57:23.481-0700    INFO    crawler/crawler.go:82   Loading and starting Inputs completed. Enabled inputs: 1
2018-07-16T17:57:23.482-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180712112148.log
2018-07-16T17:57:23.482-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180706183153.log
2018-07-16T17:57:23.482-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180715080223.log
2018-07-16T17:57:23.482-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180713214813.log
2018-07-16T17:57:23.482-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180711005710.log
2018-07-16T17:57:23.482-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180701010923.log
2018-07-16T17:57:23.483-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180708044613.log
2018-07-16T17:57:23.483-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180716172421.log
2018-07-16T17:57:23.484-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/twistd-20180629145322.log
2018-07-16T17:57:23.484-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zrc-root-20180629151701.log
2018-07-16T17:57:23.484-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zrc-root-service-20180630110001.log
2018-07-16T17:57:23.484-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zrc-root-zr-tmpreaper-20180630050003.log
2018-07-16T17:57:23.484-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zrc-zr-20180630010502.log
2018-07-16T17:57:23.484-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180702113513.log
2018-07-16T17:57:23.484-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180705081203.log
2018-07-16T17:57:23.485-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zrc-root-command-20180629144730.log
2018-07-16T17:57:23.485-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180703220300.log
2018-07-16T17:57:23.488-0700    INFO    log/harvester.go:228    Harvester started for file: /vol/log/tsar/zookeeper-20180709144543.log

(Jeremy Donahue) #4

Quick update just to say that because this is a persistent problem we are rolling back the 6.3.1 update and pinning to 6.3.0 until this is fixed.


(ruflin) #5

Filebeat should never allow more then 1 entry of same inode / device combination in the registry. So this definitively sounds like a bug.

Can you share your 6.3.0 registry file?


(ruflin) #6

@jdonahue @frioux Could one of you open a bug report on Github with this?

Did you manage to find a reliable way to reproduce it or a common pattern you see for which files / servers it happens?

From which version did you upgrade to 6.3.0 before?


(fREW Schmidt) #7

Bug: https://github.com/elastic/beats/issues/7634

On a single host in dev I can see that we installed filebeat 6.2.4 on May 15, then upgraded from 6.2.4 to 6.3.1 on the 16th, then downgraded back to 6.3.0 as jdon mentioned.

In prod I can see a host that we started with 6.2.1 on Feb 15, then upgraded to 6.2.2 on March 1, then upgraded to 6.2.3 on April 12, 6.2.3 to 6.2.4 on April 26, 6.2.4 to 6.3.0 on June 21, and 6.3.0 to 6.3.1 on July 17. Then downgraded back to 6.3.0 as jdon mentioned.

Note that both of these hosts exhibited the issue.

Example registry: https://gist.github.com/frioux/afdaa41b9197d32d840cbcc7310bde40


(Tudor Golubenco) #8

Thank you everyone for the reports, a fix is currently under works.

A workaround, for now, is to manually edit the registry file, and:

  • Replace meta: null with meta: {} everywhere
  • If there is no meta key, add meta: {}

Sorry for the trouble!


(Jeremy Donahue) #9

Hi @tudor, thanks for the workaround! Unfortunately, we did try that as part of the upgrade process, and 6.3.1 still exhibited the bug. This is what we did:

  1. Stop Filebeat
  2. $ python filebeat-registry-add-meta-field.py *
  3. $ DEBIAN_FRONTEND=noninteractive apt-get -o Dpkg::Options::="--force-confdef" install filebeat --upgrade -y --force-yes
  4. Start Filebeat

We also thought this would work, and were surprised when it didn't. Does Filebeat care about the formatting of the json, or something?

* The python script, filebeat-registry-add-meta-field.py:

import json

registry_paths = ["/vol/filebeat-kafka-raw/lib/registry", "/vol/filebeat-kafka-jsonwrap/lib/registry"]

states = {}

for path in registry_paths:
    with open(path) as f:
        registry = json.loads(f.read())

    for state in registry:
        state['meta'] = {}
        source = state['source']
        # keep the larger offset (don't re-ship logs)
        if source in states:
            if states[source]["offset"] < state["offset"]:
                states[source] = state

        else:
            states[source] = state

    registry = list(states.values())

    with open(path, 'w') as f:
        f.write(json.dumps(registry))

(Tudor Golubenco) #10

The script looks good, thanks. To clarify, the posted registry file (https://gist.github.com/frioux/afdaa41b9197d32d840cbcc7310bde40) doesn't contain any meta keys. Is it perhaps the version after downgrading, or after cleanup? Could we have a registry files that shows the meta: null situation?


(Jeremy Donahue) #11

@tudor Yes, frioux/afdaa4 is the original (unbroken), 6.3.0 version.

Here's a (different from above) registry file before the null situation: https://gist.github.com/jeremydonahue/60b911c40d02058b97e48804409abd34
The same registry file after the meta: null situation: https://gist.github.com/jeremydonahue/db52d6d4d17773cec5593703bb5e5d8b


(Tudor Golubenco) #12

Thank you @jdonahue and @frioux for the help in debugging this! We have added code to recover from that situation and reviewed the code paths to make sure meta: nill can't happen and even if it happens it's still handled the same as meta: {}. I'm still a bit worried that the workaround didn't work for you, but I think we took enough preventive measures that this won't happen in the future.

The fix is being included in 6.3.2 which we expect to release sometime next week.


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.