Registry load gets caught in infinite loop

Hello,

I'm using Filebeat to process a big number of logfiles and I've got the feeling that the registry is not able to cope with the amount. The config I am using has worked the first time and the startup happens immediately when I delete the registry file before starting. For obvious reasons, I can't do this in production.

OS: Ubuntu 16.04
Filebeat version 6.3.2 (amd64), libbeat 6.3.2 [45a9a9e1561b6c540e94211ebe03d18abcacae55 built 2018-07-20 04:18:19 +0000 UTC]

The startup log is as follows:

2018-08-23T11:11:45.915Z INFO instance/beat.go:315 filebeat start running.
2018-08-23T11:11:45.915Z INFO registrar/registrar.go:117 Loading registrar data from /var/lib/filebeat/registry
2018-08-23T11:11:45.915Z INFO [monitoring] log/log.go:97 Starting metrics logging every 30s
2018-08-23T11:11:46.570Z INFO registrar/registrar.go:124 States Loaded from registrar: 100668
2018-08-23T11:11:46.570Z INFO crawler/crawler.go:48 Loading Inputs: 1

From that point, I'm only getting metrics printed out - I've been waiting for 30 minutes but the loading didn't finish.

I'm unable to reduce the number of individual files and I've also reduced the ignore_older and clean_inactive time.

My config looks as follows:

  paths:
   - /var/lib/jenkins/logs/*
   - /var/lib/jenkins/jobs/*/branches/*/builds/*/*.log
   - /var/lib/jenkins/jobs/*/builds/*/*.log
  symlinks: false
  exclude_files: [ 'lastStableBuild', 'lastSuccessfulBuild' ]
  close_inactive: 5s
  scan_frequency: 10s
  ignore_older: 24h
  clean_inactive: 25h

Thank you in advance.

I have used strace on one of the processes and noticed the following output:

$ strace -f -p 6550 -e file,write,close,open
[pid  6535] lstat("/var/lib/filebeat/registry", {st_mode=S_IFREG|0600, st_size=22832396, ...}) = 0
[pid  6535] renameat(AT_FDCWD, "/var/lib/filebeat/registry.new", AT_FDCWD, "/var/lib/filebeat/registry") = 0
[pid  6535] openat(AT_FDCWD, "/var/lib/filebeat", O_RDONLY|O_CLOEXEC) = 5
[pid  6535] close(5)                    = 0
[pid  6535] openat(AT_FDCWD, "/var/lib/filebeat/registry.new", O_RDWR|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC, 0600) = 5
[pid  6558] write(5, "[{\"source\":\"/var/lib/jenkins/job"..., 22832420) = 22832420
[pid  6558] close(5)                    = 0
[pid  6558] lstat("/var/lib/filebeat/registry", {st_mode=S_IFREG|0600, st_size=22832408, ...}) = 0
[pid  6558] renameat(AT_FDCWD, "/var/lib/filebeat/registry.new", AT_FDCWD, "/var/lib/filebeat/registry") = 0
[pid  6558] openat(AT_FDCWD, "/var/lib/filebeat", O_RDONLY|O_CLOEXEC) = 5
[pid  6558] close(5)                    = 0
[pid  6558] openat(AT_FDCWD, "/var/lib/filebeat/registry.new", O_RDWR|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC, 0600) = 5
[pid  6558] write(5, "[{\"source\":\"/var/lib/jenkins/job"..., 22832433) = 22832433
[pid  6558] close(5)                    = 0
[pid  6558] lstat("/var/lib/filebeat/registry", {st_mode=S_IFREG|0600, st_size=22832420, ...}) = 0
[pid  6558] renameat(AT_FDCWD, "/var/lib/filebeat/registry.new", AT_FDCWD, "/var/lib/filebeat/registry") = 0
[pid  6558] openat(AT_FDCWD, "/var/lib/filebeat", O_RDONLY|O_CLOEXEC) = 5
[pid  6558] close(5)                    = 0
[pid  6535] openat(AT_FDCWD, "/var/lib/filebeat/registry.new", O_RDWR|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC, 0600) = 5
[pid  6534] write(5, "[{\"source\":\"/var/lib/jenkins/job"..., 22832445) = 22832445
[pid  6534] close(5)                    = 0

It seems like it's hanging to write the new registry file. During this time, I have write speed of about 100MB/s. The old registry file is about 23MB and the new registry file has 0 bytes.

For the record:

Adding filebeat.registry_flush: 1s to the top level config in filebeat.yml reduces the time from multiple hours to about 3 seconds.

I'm a bit underwhelmed by the lack of responses here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.