Filebeat registry not working for filestream input

I found some of the filebeat instances in our production environment resend logfiles and suspected the registry as the root cause.

Trying to dig into it I got really confused about the registry, maybe someone can help me with this.

What I did:

On a testsystem with ubuntu 22.04 installed I

  • unistalled filebeat (apt purge + remove of the folders /usr/share/filebeat, /etc/filebeat, /var/lib/filebeat)
  • installed filebeat 8.18.0, and setup one input for filebeat
  • checked for logs in kibana and if a registry file exists
  • restarted filebeat to check for duplicates in kibana

I did this two times, first time the filebeat input is of type “filestream”, second time the input is of type “log”. (And I actually did it another time for both cases to verify my findings)

  • In both cases, there is a /var/lib/filebeat/filebeat/log.json with entries that could be a registry.
  • In case of “filestream” logs got resent on a filebeat restart so that all events are duplicated in kibana.
  • In case of “log” there where no duplicates.

I am confused about the registry name, I expected something like 1234567.log as I find this on all the productive servers.

root@productive.system.somedoamin:/var/lib/filebeat/filebeat# ls -la
total 10692
drwxr-x--- 2 root root    4096 Nov 26 10:24 .
drwxr-x--- 3 root root    4096 Nov 26 11:01 ..
-rw------- 1 root root 1500536 Nov 26 10:24 1388703.json
-rw------- 1 root root      39 Nov 26 10:24 active.dat
-rw------- 1 root root 9421401 Nov 28 12:51 log.json
-rw------- 1 root root      15 Dez 16  2024 meta.json

The format of 1388703.json and logs.json is different! But I find hints in filebeat logs, that the log.json is a registry:

Nov 28 13:10:24 test.system.somedomain filebeat[3352563]: {"log.level":"info","@timestamp":"2025-11-28T13:10:24.947+0100","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.openStore","file.name":"memlog/store.go","file.line":134},"message":"Finished loading transaction log file for '/var/lib/filebeat/filebeat'. Active transaction id=8","service.name":"filebeat","ecs.version":"1.6.0"}

These log entries exists in both testcases, still in case of “filestream” everything is resent.

My questions:

  • Is /var/lib/filebeat/filebeat/logs.json the registry? If yes, when was /var/lib/filebeat/filebeat/1234567.json removed? If no, how does it work otherwise?
  • Why are “filestream”-logs resent, while for “log” filebeat keeps track of it’s offset?
  • How can I make it work for both type of inputs?
  • Is this related to the file_identity, although I use v 8.18? (How to choose file identity for filestream | Beats)

Thank you very much for your help!

This is the filebeat.yaml

name: test.system.somedomain
tags: []
fields_under_root: true
filebeat:
  config.inputs:
    enabled: true
    path: "/etc/filebeat/conf.d/*.yml"
  config.modules:
    enabled: false
    path: "/etc/filebeat/modules.d/*.yml"
  modules: []
  overwrite_pipelines: false
  shutdown_timeout: '0'
  registry:
    path: "/var/lib/filebeat"
    file_permissions: '0600'
    flush: 0s
  autodiscover: {}
http: {}
cloud: {}
queue: {}
output:
  logstash:
    ssl:
      enabled: true
      verification_mode: full
      certificate_authorities: "/etc/ssl/custom/ca.crt"
    hosts:
    - test.system.somedomain:5055
    loadbalance: true
shipper: {}
logging: {}

This is the “log” input:

---
- type: log
  paths:
  - /var/log/logstash/logstash-plain.log
  encoding: plain
  document_type: log
  scan_frequency: 10s
  harvester_buffer_size: 16384
  max_bytes: 10485760
  multiline:
    pattern: '^\[\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}\]'
    negate: true
    match: after
    max_lines: 10000
    timeout: 15s
  tail_files: false

  # Experimental: If symlinks is enabled, symlinks are opened and harvested. The harvester is openening the
  # original for harvesting but will report the symlink name as source.
  #symlinks: false

  backoff: 1s
  max_backoff: 10s
  backoff_factor: 2

  # Experimental: Max number of harvesters that are started in parallel.
  # Default is 0 which means unlimited

  ### Harvester closing options

  # Close inactive closes the file handler after the predefined period.
  # The period starts when the last line of the file was, not the file ModTime.
  # Time strings like 2h (2 hours), 5m (5 minutes) can be used.
  close_inactive: 5m

  # Close renamed closes a file handler when the file is renamed or rotated.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_renamed: false

  # When enabling this option, a file handler is closed immediately in case a file can't be found
  # any more. In case the file shows up again later, harvesting will continue at the last known position
  # after scan_frequency.
  close_removed: true

  # Closes the file handler as soon as the harvesters reaches the end of the file.
  # By default this option is disabled.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_eof: false

  ### State options

  # Files for the modification data is older then clean_inactive the state from the registry is removed
  # By default this is disabled.
  clean_inactive: 0

  # Removes the state for file which cannot be found on disk anymore immediately
  clean_removed: true

  # Close timeout closes the harvester after the predefined time.
  # This is independent if the harvester did finish reading the file or not.
  # By default this option is disabled.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_timeout: 0
  fields:
    type: logstash
  fields_under_root: true

This is the “filestream” input:

---
- type: filestream
  id: logstash
  paths:
  - /var/log/logstash/logstash-plain.log
  encoding: plain
  document_type: log
  prospector:
     scanner:
        check_interval: 10s
  harvester_buffer_size: 16384
  message_max_bytes: 10485760
  parsers:
    - multiline:
        pattern: '^\[\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}\]'
        negate: true
        match: after
        max_lines: 10000
        timeout: 15s
  tail_files: false

  # Experimental: If symlinks is enabled, symlinks are opened and harvested. The harvester is openening the
  # original for harvesting but will report the symlink name as source.
  #symlinks: false

  backoff.init: 1s
  backoff.max: 10s
  backoff_factor: 2

  # Experimental: Max number of harvesters that are started in parallel.
  # Default is 0 which means unlimited

  ### Harvester closing options

  # Close inactive closes the file handler after the predefined period.
  # The period starts when the last line of the file was, not the file ModTime.
  # Time strings like 2h (2 hours), 5m (5 minutes) can be used.
  close_inactive: 5m

  # Close renamed closes a file handler when the file is renamed or rotated.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_renamed: false

  # When enabling this option, a file handler is closed immediately in case a file can't be found
  # any more. In case the file shows up again later, harvesting will continue at the last known position
  # after scan_frequency.
  close_removed: true

  # Closes the file handler as soon as the harvesters reaches the end of the file.
  # By default this option is disabled.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_eof: false

  ### State options

  # Files for the modification data is older then clean_inactive the state from the registry is removed
  # By default this is disabled.
  clean_inactive: 0

  # Removes the state for file which cannot be found on disk anymore immediately
  clean_removed: true

  # Close timeout closes the harvester after the predefined time.
  # This is independent if the harvester did finish reading the file or not.
  # By default this option is disabled.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_timeout: 0
  fields:
    type: logstash
  fields_under_root: true