Registry is Never Cleaned

Filebeat 6.2.2, Ubuntu 14.04

So we have a couple of servers that are basically never shipping logs. I think the problem is that they at one point had so many logs that their registry is so big that they have trouble ever having a chance to try to clean the registry. Here is the config:

filebeat.prospectors:
  - type: log
    paths:
      - /vol/log/tsar/**.log
    close_inactive: 5m
    close_removed: true

filebeat.shutdown_timeout: 10s
filebeat.registry_flush: 30s

output.kafka:
  hosts: ["kafka-01.stg-ue1.zipaws.com:9092","kafka-02.stg-ue1.zipaws.com:9092","kafka-03.stg-ue1.zipaws.com:9092",]
  topic: unified-logs
  required_acks: 1
  compression: snappy
  client_id: 'filebeat.svc'
  keep_alive: 10m
  partition.round_robin:
    reachable_only: true
  codec.format:
    string: "%{[message]}"

logging.level: debug

To be concrete, here is how many files are in the registry:

$ sudo cat /vol/filebeat-kafka-raw/lib/registry | jq -r '.[].source' | wc -l
365998

And here is how many files actually exist:

$ ls /vol/log/tsar | wc -l
750

Here's a fun snippet from the debug logs:

2018-03-26T09:40:22.642-0700    DEBUG   [prospector]    file/state.go:82        New state added for /vol/log/tsar/zr-resume-parser-20180204032744.log
2018-03-26T09:40:22.643-0700    DEBUG   [registrar]     registrar/registrar.go:193      Registrar states cleaned up. Before: 365998, After: 365998
2018-03-26T09:40:22.643-0700    DEBUG   [registrar]     registrar/registrar.go:200      Processing 1 events
2018-03-26T09:40:22.656-0700    DEBUG   [registrar]     registrar/registrar.go:193      Registrar states cleaned up. Before: 365998, After: 365998
2018-03-26T09:40:22.656-0700    DEBUG   [registrar]     registrar/registrar.go:200      Processing 1 events
2018-03-26T09:40:22.657-0700    DEBUG   [prospector]    file/state.go:82        New state added for /vol/log/tsar/zr-resume-parser-20180204022052.log
2018-03-26T09:40:22.669-0700    DEBUG   [registrar]     registrar/registrar.go:193      Registrar states cleaned up. Before: 365998, After: 365998
2018-03-26T09:40:22.669-0700    DEBUG   [registrar]     registrar/registrar.go:200      Processing 1 events
2018-03-26T09:40:22.670-0700    DEBUG   [prospector]    file/state.go:82        New state added for /vol/log/tsar/zr-resume-parser-20180204033051.log
2018-03-26T09:40:22.682-0700    DEBUG   [prospector]    file/state.go:82        New state added for /vol/log/tsar/zr-resume-parser-20180204033221.log
2018-03-26T09:40:22.682-0700    DEBUG   [registrar]     registrar/registrar.go:193      Registrar states cleaned up. Before: 365998, After: 365998
2018-03-26T09:40:22.682-0700    DEBUG   [registrar]     registrar/registrar.go:200      Processing 1 events
2018-03-26T09:40:22.689-0700    DEBUG   [prospector]    file/state.go:82        New state added for /vol/log/tsar/zr-resume-parser-20180204022931.log
2018-03-26T09:40:22.703-0700    DEBUG   [registrar]     registrar/registrar.go:193      Registrar states cleaned up. Before: 365998, After: 365998
2018-03-26T09:40:22.703-0700    DEBUG   [registrar]     registrar/registrar.go:200      Processing 1 events
2018-03-26T09:40:22.710-0700    DEBUG   [prospector]    file/state.go:82        New state added for /vol/log/tsar/zr-resume-parser-20180204025349.log
2018-03-26T09:40:22.722-0700    DEBUG   [registrar]     registrar/registrar.go:193      Registrar states cleaned up. Before: 365998, After: 365998
2018-03-26T09:40:22.722-0700    DEBUG   [registrar]     registrar/registrar.go:200      Processing 1 events

My guess is that if filebeat were to remove a file from the registry if it's missing, this would eventually make progress, but as it stands it takes 24 minutes just to do this (I didn't time it, I just did the math of the count of files times 4ms.)

Is there something I am missing in my config? I am under the impression that clean_removed should handle this. Is my understanding correct that if I leave this alone long enough it eventually will? (note I have left it alone for days on end.)

Yes, indeed clean_removed should take care of these files, but maybe they are being created faster than filebeat checks for deletions and they are stalled.

Do you see any line like Remove state for file as file removed... in your logs?

There haven't been new files in quite a while (we fixed what allowed the huge amount of files to be created, categorically.)

I don't have that message in my logs, no.

Just to give you a feel for the velocity at which new files are created, here is the tail of ls -lhtr /vol/log/tsar:

-rw-r--r-- 1 www-data zr   51M Mar 22 19:52 locres-batch-20180322164614.log
-rw-r--r-- 1 www-data zr   51M Mar 23 01:40 locres-batch-20180322195214.log
-rw-r--r-- 1 www-data zr   51M Mar 23 09:58 locres-batch-20180323014051.log
-rw-r--r-- 1 www-data zr  5.4M Mar 23 10:52 title-20180322183948.log
-rw-r--r-- 1 www-data zr   12M Mar 23 10:52 deep-match-20180322183949.log
-rw-r--r-- 1 root     zr  5.4M Mar 23 10:52 classifier-20180322183950.log
-rw-r--r-- 1 zr       zr   26K Mar 23 10:52 parser-20180322183950.log
-rw-r--r-- 1 zr       zr   27K Mar 23 10:53 jobs-20180322183953.log
-rw-r--r-- 1 zr       zr  2.9M Mar 23 10:53 parser-20180323013956.log
-rw-r--r-- 1 zr       zr  7.9M Mar 23 10:53 persistence-20180323013955.log
-rw-r--r-- 1 zr       zr  475K Mar 23 10:53 simjobs-20180322183957.log
-rw-r--r-- 1 zr       zr  231K Mar 23 10:53 suggest-candidates-20180322183959.log
-rw-r--r-- 1 www-data zr   51M Mar 23 16:14 locres-batch-20180323095802.log
-rw-r--r-- 1 www-data zr   51M Mar 23 23:37 locres-batch-20180323161420.log
-rw-r--r-- 1 www-data zr   51M Mar 24 09:38 locres-batch-20180323233726.log
-rw-r--r-- 1 www-data zr   51M Mar 24 15:41 locres-batch-20180324093857.log
-rw-r--r-- 1 www-data zr   51M Mar 24 20:22 locres-batch-20180324154127.log
-rw-r--r-- 1 www-data zr   51M Mar 25 04:37 locres-batch-20180324202244.log
-rw-r--r-- 1 www-data zr   51M Mar 25 13:09 locres-batch-20180325043758.log
-rw-r--r-- 1 www-data zr   51M Mar 25 19:32 locres-batch-20180325130907.log
-rw-r--r-- 1 zr       zr  5.6M Mar 26 01:18 zrc-zr-2018022101150201.log
-rw-r--r-- 1 zr       zr   15K Mar 26 02:33 parser-20180326090012.log
-rw-r--r-- 1 root     zr  1.1M Mar 26 04:05 zrc-root-docker-20180221040501.log
-rw-r--r-- 1 root     zr  762K Mar 26 05:09 zrc-root-zr-tmpreaper-20180221050001.log
-rw-r--r-- 1 www-data zr   51M Mar 26 05:58 locres-batch-20180325193248.log
-rw-r--r-- 1 www-data zr   51M Mar 26 08:27 deep-match-20180323105257.log
-rw-r--r-- 1 zr       zr   32M Mar 26 09:33 persistence-20180323175303.log
-rw-r--r-- 1 www-data zr   51M Mar 26 09:47 locres-batch-20180326055824.log
-rw-r--r-- 1 zr       zr  134K Mar 26 09:52 parser-20180323105258.log
-rw-r--r-- 1 root     zr  9.7M Mar 26 10:17 zrc-root-20180221081701.log
-rw-r--r-- 1 root     zr   14M Mar 26 10:25 zrc-root-command-20180316155501.log
-rw-r--r-- 1 zr       zr  141K Mar 26 10:27 matched-jobs-20180323105301.log
-rw-r--r-- 1 zr       zr  928K Mar 26 10:29 suggest-candidates-20180323105307.log
-rw-r--r-- 1 root     zr   28M Mar 26 10:30 zrc-root-service-20180316155501.log
-rw-r--r-- 1 zr       zr  2.1M Mar 26 10:31 jobs-20180323105305.log
-rw-r--r-- 1 zr       zr  389K Mar 26 10:33 persistence-20180326163327.log
-rw-r--r-- 1 www-data zr   24M Mar 26 10:33 classify-title-20180323105255.log
-rw-r--r-- 1 www-data zr  2.0M Mar 26 10:33 deep-match-20180326082726.log
-rw-r--r-- 1 zr       zr   13M Mar 26 10:33 parser-20180323175303.log
-rw-r--r-- 1 root     zr   24M Mar 26 10:33 classifier-20180323105258.log
-rw-r--r-- 1 www-data zr   18M Mar 26 10:33 geoac-20180322111227.log
-rw-r--r-- 1 www-data zr   19M Mar 26 10:33 locres-20180322111230.log
-rw-r--r-- 1 www-data zr   11M Mar 26 10:33 locres-batch-20180326094722.log

There are going to be some improvements for big registries in coming releases.

Meanwhile, as a workaround, you may try to clean up the registry "by hand" when the problem arises, for that you can do this:

  • Stop filebeat
  • Backup your registry
    cp registry registry.bak
  • Run the script you can find below
    python cleanup_registry.py registry.bak > registry
  • Start filebeat
import json
import sys
import os.path

registry = json.load(open(sys.argv[1]))

new_registry = []

for entry in registry:
	if os.path.isfile(entry["source"]):
		new_registry.append(entry)

print json.dumps(new_registry)

Aight, I can do that. Thanks for the idea. I'm gonna see if I delete half of them if it can recover on it's own, and will report back here, only because if filebeat can't handle registries over some amount of entries we would do well to know how many that is.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.