I found some of the filebeat instances in our production environment resend logfiles and suspected the registry as the root cause.
Trying to dig into it I got really confused about the registry, maybe someone can help me with this.
What I did:
On a testsystem with ubuntu 22.04 installed I
- unistalled filebeat (apt purge + remove of the folders /usr/share/filebeat, /etc/filebeat, /var/lib/filebeat)
- installed filebeat 8.18.0, and setup one input for filebeat
- checked for logs in kibana and if a registry file exists
- restarted filebeat to check for duplicates in kibana
I did this two times, first time the filebeat input is of type “filestream”, second time the input is of type “log”. (And I actually did it another time for both cases to verify my findings)
- In both cases, there is a /var/lib/filebeat/filebeat/log.json with entries that could be a registry.
- In case of “filestream” logs got resent on a filebeat restart so that all events are duplicated in kibana.
- In case of “log” there where no duplicates.
I am confused about the registry name, I expected something like 1234567.log as I find this on all the productive servers.
root@productive.system.somedoamin:/var/lib/filebeat/filebeat# ls -la
total 10692
drwxr-x--- 2 root root 4096 Nov 26 10:24 .
drwxr-x--- 3 root root 4096 Nov 26 11:01 ..
-rw------- 1 root root 1500536 Nov 26 10:24 1388703.json
-rw------- 1 root root 39 Nov 26 10:24 active.dat
-rw------- 1 root root 9421401 Nov 28 12:51 log.json
-rw------- 1 root root 15 Dez 16 2024 meta.json
The format of 1388703.json and logs.json is different! But I find hints in filebeat logs, that the log.json is a registry:
Nov 28 13:10:24 test.system.somedomain filebeat[3352563]: {"log.level":"info","@timestamp":"2025-11-28T13:10:24.947+0100","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.openStore","file.name":"memlog/store.go","file.line":134},"message":"Finished loading transaction log file for '/var/lib/filebeat/filebeat'. Active transaction id=8","service.name":"filebeat","ecs.version":"1.6.0"}
These log entries exists in both testcases, still in case of “filestream” everything is resent.
My questions:
- Is /var/lib/filebeat/filebeat/logs.json the registry? If yes, when was /var/lib/filebeat/filebeat/1234567.json removed? If no, how does it work otherwise?
- Why are “filestream”-logs resent, while for “log” filebeat keeps track of it’s offset?
- How can I make it work for both type of inputs?
- Is this related to the file_identity, although I use v 8.18? (How to choose file identity for filestream | Beats)
Thank you very much for your help!
This is the filebeat.yaml
name: test.system.somedomain
tags: []
fields_under_root: true
filebeat:
config.inputs:
enabled: true
path: "/etc/filebeat/conf.d/*.yml"
config.modules:
enabled: false
path: "/etc/filebeat/modules.d/*.yml"
modules: []
overwrite_pipelines: false
shutdown_timeout: '0'
registry:
path: "/var/lib/filebeat"
file_permissions: '0600'
flush: 0s
autodiscover: {}
http: {}
cloud: {}
queue: {}
output:
logstash:
ssl:
enabled: true
verification_mode: full
certificate_authorities: "/etc/ssl/custom/ca.crt"
hosts:
- test.system.somedomain:5055
loadbalance: true
shipper: {}
logging: {}
This is the “log” input:
---
- type: log
paths:
- /var/log/logstash/logstash-plain.log
encoding: plain
document_type: log
scan_frequency: 10s
harvester_buffer_size: 16384
max_bytes: 10485760
multiline:
pattern: '^\[\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}\]'
negate: true
match: after
max_lines: 10000
timeout: 15s
tail_files: false
# Experimental: If symlinks is enabled, symlinks are opened and harvested. The harvester is openening the
# original for harvesting but will report the symlink name as source.
#symlinks: false
backoff: 1s
max_backoff: 10s
backoff_factor: 2
# Experimental: Max number of harvesters that are started in parallel.
# Default is 0 which means unlimited
### Harvester closing options
# Close inactive closes the file handler after the predefined period.
# The period starts when the last line of the file was, not the file ModTime.
# Time strings like 2h (2 hours), 5m (5 minutes) can be used.
close_inactive: 5m
# Close renamed closes a file handler when the file is renamed or rotated.
# Note: Potential data loss. Make sure to read and understand the docs for this option.
close_renamed: false
# When enabling this option, a file handler is closed immediately in case a file can't be found
# any more. In case the file shows up again later, harvesting will continue at the last known position
# after scan_frequency.
close_removed: true
# Closes the file handler as soon as the harvesters reaches the end of the file.
# By default this option is disabled.
# Note: Potential data loss. Make sure to read and understand the docs for this option.
close_eof: false
### State options
# Files for the modification data is older then clean_inactive the state from the registry is removed
# By default this is disabled.
clean_inactive: 0
# Removes the state for file which cannot be found on disk anymore immediately
clean_removed: true
# Close timeout closes the harvester after the predefined time.
# This is independent if the harvester did finish reading the file or not.
# By default this option is disabled.
# Note: Potential data loss. Make sure to read and understand the docs for this option.
close_timeout: 0
fields:
type: logstash
fields_under_root: true
This is the “filestream” input:
---
- type: filestream
id: logstash
paths:
- /var/log/logstash/logstash-plain.log
encoding: plain
document_type: log
prospector:
scanner:
check_interval: 10s
harvester_buffer_size: 16384
message_max_bytes: 10485760
parsers:
- multiline:
pattern: '^\[\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}\]'
negate: true
match: after
max_lines: 10000
timeout: 15s
tail_files: false
# Experimental: If symlinks is enabled, symlinks are opened and harvested. The harvester is openening the
# original for harvesting but will report the symlink name as source.
#symlinks: false
backoff.init: 1s
backoff.max: 10s
backoff_factor: 2
# Experimental: Max number of harvesters that are started in parallel.
# Default is 0 which means unlimited
### Harvester closing options
# Close inactive closes the file handler after the predefined period.
# The period starts when the last line of the file was, not the file ModTime.
# Time strings like 2h (2 hours), 5m (5 minutes) can be used.
close_inactive: 5m
# Close renamed closes a file handler when the file is renamed or rotated.
# Note: Potential data loss. Make sure to read and understand the docs for this option.
close_renamed: false
# When enabling this option, a file handler is closed immediately in case a file can't be found
# any more. In case the file shows up again later, harvesting will continue at the last known position
# after scan_frequency.
close_removed: true
# Closes the file handler as soon as the harvesters reaches the end of the file.
# By default this option is disabled.
# Note: Potential data loss. Make sure to read and understand the docs for this option.
close_eof: false
### State options
# Files for the modification data is older then clean_inactive the state from the registry is removed
# By default this is disabled.
clean_inactive: 0
# Removes the state for file which cannot be found on disk anymore immediately
clean_removed: true
# Close timeout closes the harvester after the predefined time.
# This is independent if the harvester did finish reading the file or not.
# By default this option is disabled.
# Note: Potential data loss. Make sure to read and understand the docs for this option.
close_timeout: 0
fields:
type: logstash
fields_under_root: true