Filebeat: Exiting: Can only start an input when all related states are finished

Hello,
my filebeat 6.2.4 had been running for quite a long. Yesterday I had to restart it and it turned out it is bouncing every couple of seconds with the following message:

Sep 07 09:42:02 jira filebeat[93968]: Exiting: Can only start an input when all related states are finished: {Id:2274-2049 Finished:false Fileinfo:0xc420508ea0 Source:/var/log/syslog Offset:330285 Timestamp:2018-09-07 09:41:59.952889366 +0000 UTC m=+2.074304955 TTL:-1ns Type:log Meta:map[] FileStateOS:2274-2049}
Sep 07 09:42:02 jira systemd[1]: filebeat.service: Main process exited, code=exited, status=1/FAILURE
Sep 07 09:42:02 jira systemd[1]: Stopped Filebeat sends log files to Logstash or directly to Elasticsearch..

Then I have upgraded filebeat to 6.3 and then to 6.4 but it didn't help. Here is my config file:

filebeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  reload.period: 60s

filebeat.prospectors:
- type: log
  paths:
    - /var/log/*
  fields:
    type: system
  exclude_files: ['\.gz$', 'btmp*', 'wtmp*']
  close_inactive: 1h

- type: log
  paths:
    - /opt/app1/logs/*.log
  fields:
    type: app1
  exclude_files: ['\.gz$']
  close_inactive: 5h
  multiline.pattern: '^[[:space:]]'
  multiline.negate: false
  multiline.match: after

- type: log
  paths:
    - /opt/atlassian/jira/logs/*.log
  fields:
    type: jira
  exclude_files: ['\.gz$']
  close_inactive: 1h
  multiline.pattern: '^[[:space:]]'
  multiline.negate: false
  multiline.match: after

- type: log
  paths:
    - /opt/app1/submodule1/current/logs/*.log
  fields:
    type: submodule1
  exclude_files: ['\.gz$']
  close_inactive: 1h
  multiline.pattern: '^[[:space:]]'
  multiline.negate: false
  multiline.match: after

- type: log
  paths:
    - /opt/app1/submodule2/current/logs/*.log
  fields:
    type: submodule2
  exclude_files: ['\.gz$']
  close_inactive: 1h
  multiline.pattern: '^[[:space:]]'
  multiline.negate: false
  multiline.match: after

- type: log
  paths:
    - /opt/app1/submodule3/logs/*.log
  fields:
    type: submodule3
  exclude_files: ['\.gz$']
  close_inactive: 1h
  multiline.pattern: '^[[:space:]]'
  multiline.negate: false
  multiline.match: after

output.logstash:
  hosts: ["x.x.x.x:5043"]

I can see that it is failing on /var/log/syslog but have no idea why ...
Can anybody enlighten me please ?

It's due to a race condition in Filebeat during saving the state to the registry file while shutting down. The harvester which read /var/log/syslog was not cleaned up properly. This lead to a corrupt registry file.
You could work around the issue by editing the registry. Set Finished in entry of /var/log/syslog to true and Filebeat can start again. This could lead to resending the last batch of previously read messages from the file. So if that's a problem, don't do it.

Also, could you please open an issue on Github? It's a tricky bug to reproduce, but it needs to be fixed. https://github.com/elastic/beats/issues/new

2 Likes

Hi @kvch,

thanks for prompt reply !
Given the syslog entry in the registry:

  {
    "source": "/var/log/syslog",
    "offset": 330285,
    "timestamp": "2018-09-07T09:41:59.952889366Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
      "inode": 2274,
      "device": 2049
    }
  },

where should I put the Finished entry ?

Indeed. It's not serialized.
Could you please share the debug logs of Filebeat? (./filebeat -e -d "*")

But there isn't any Finished entry for any of these files in the registry.
Have just realized that it is also complaining about /var/log/auth.log file.
Is the syntax correct:

{
    "Finished": "true"
    "source": "/var/log/syslog",
    "offset": 330285,
    "timestamp": "2018-09-07T09:41:59.952889366Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
      "inode": 2274,
      "device": 2049
    }
  },

Yes, there is no such field, because it's not serialized. I would like to see the debug logs, so I can track back how your Filebeat instance works out if the state is finished or not. So I can suggest you a way to work around the problem without log duplication. :slight_smile:

The debug log is huge. What part of it do you want to see ?

It enough if you just start the Beat again run it for a minute (tops) and stop it afterwards. I am only interested in the part where Filebeat scans for files and reads the registry file at the beginning.

Here you are:

root@jira:/etc/filebeat# /usr/share/filebeat/bin/filebeat -e -d "*"
2018-09-07T12:23:51.421Z	INFO	instance/beat.go:544	Home path: [/usr/share/filebeat/bin] Config path: [/usr/share/filebeat/bin] Data path: [/usr/share/filebeat/bin/data] Logs path: [/usr/share/filebeat/bin/logs]
2018-09-07T12:23:51.421Z	DEBUG	[beat]	instance/beat.go:571	Beat metadata path: /usr/share/filebeat/bin/data/meta.json
2018-09-07T12:23:51.449Z	INFO	instance/beat.go:551	Beat UUID: d3bdd3ba-5b90-468c-80fa-79f3857968bb
2018-09-07T12:23:51.449Z	DEBUG	[seccomp]	seccomp/seccomp.go:109	Loading syscall filter	{"seccomp_filter": {"no_new_privs":true,"flag":"tsync","policy":{"default_action":"errno","syscalls":[{"names":["accept","accept4","access","arch_prctl","bind","brk","clock_gettime","clone","close","connect","dup","dup2","epoll_create","epoll_create1","epoll_ctl","epoll_pwait","epoll_wait","exit","exit_group","fchdir","fchmod","fchown","fcntl","fdatasync","flock","fstat","fsync","ftruncate","futex","getcwd","getdents","getdents64","geteuid","getgid","getpeername","getpid","getppid","getrandom","getrusage","getsockname","getsockopt","gettid","gettimeofday","getuid","inotify_add_watch","inotify_init1","inotify_rm_watch","ioctl","kill","listen","lseek","lstat","madvise","mincore","mkdirat","mmap","mprotect","munmap","nanosleep","newfstatat","open","openat","pipe","pipe2","poll","pread64","pselect6","pwrite64","read","readlink","readlinkat","recvfrom","recvmmsg","recvmsg","rename","renameat","rt_sigaction","rt_sigprocmask","rt_sigreturn","sched_getaffinity","sched_yield","sendfile","sendmmsg","sendmsg","sendto","set_robust_list","setitimer","setsockopt","shutdown","sigaltstack","socket","stat","statfs","sysinfo","tgkill","time","tkill","uname","unlink","unlinkat","wait4","waitid","write","writev"],"action":"allow"}]}}}
2018-09-07T12:23:51.450Z	INFO	[seccomp]	seccomp/seccomp.go:116	Syscall filter successfully installed
2018-09-07T12:23:51.451Z	INFO	[beat]	instance/beat.go:768	Beat info	{"system_info": {"beat": {"path": {"config": "/usr/share/filebeat/bin", "data": "/usr/share/filebeat/bin/data", "home": "/usr/share/filebeat/bin", "logs": "/usr/share/filebeat/bin/logs"}, "type": "filebeat", "uuid": "d3bdd3ba-5b90-468c-80fa-79f3857968bb"}}}
2018-09-07T12:23:51.451Z	INFO	[beat]	instance/beat.go:777	Build info	{"system_info": {"build": {"commit": "34b4e2cc75fbbee5e7149f3916de72fb8892d070", "libbeat": "6.4.0", "time": "2018-08-17T22:20:20.000Z", "version": "6.4.0"}}}
2018-09-07T12:23:51.451Z	INFO	[beat]	instance/beat.go:780	Go runtime info	{"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":2,"version":"go1.10.3"}}}
2018-09-07T12:23:51.459Z	INFO	[beat]	instance/beat.go:784	Host info	{"system_info": {"host": {"architecture":"x86_64","boot_time":"2018-01-30T08:13:34Z","containerized":false,"hostname":"jira","ips":["127.0.0.1/8","::1/128","10.101.4.4/24","fe80::222:48ff:fe01:1989/64"],"kernel_version":"4.13.0-1007-azure","mac_addresses":["00:22:48:01:19:89"],"os":{"family":"debian","platform":"ubuntu","name":"Ubuntu","version":"16.04.3 LTS (Xenial Xerus)","major":16,"minor":4,"patch":3,"codename":"xenial"},"timezone":"UTC","timezone_offset_sec":0,"id":"3dec2bea2c5747c2b95d853f2a55dbfa"}}}
2018-09-07T12:23:51.460Z	INFO	[beat]	instance/beat.go:813	Process info	{"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"effective":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/etc/filebeat", "exe": "/usr/share/filebeat/bin/filebeat", "name": "filebeat", "pid": 7716, "ppid": 7174, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2018-09-07T12:23:51.300Z"}}}
2018-09-07T12:23:51.461Z	INFO	instance/beat.go:273	Setup Beat: filebeat; Version: 6.4.0
2018-09-07T12:23:51.461Z	DEBUG	[beat]	instance/beat.go:290	Initializing output plugins
2018-09-07T12:23:51.461Z	DEBUG	[processors]	processors/processor.go:66	Processors:
2018-09-07T12:23:51.462Z	DEBUG	[publish]	pipeline/consumer.go:137	start pipeline event consumer
2018-09-07T12:23:51.462Z	INFO	pipeline/module.go:98	Beat name: jira
2018-09-07T12:23:51.463Z	WARN	[cfgwarn]	beater/filebeat.go:78	DEPRECATED: prospectors are deprecated, Use `inputs` instead. Will be removed in version: 7.0.0
2018-09-07T12:23:51.469Z	ERROR	fileset/modules.go:118	Not loading modules. Module directory not found: /usr/share/filebeat/bin/module
2018-09-07T12:23:51.469Z	INFO	[monitoring]	log/log.go:114	Starting metrics logging every 30s
2018-09-07T12:23:51.469Z	INFO	instance/beat.go:367	filebeat start running.
2018-09-07T12:23:51.469Z	INFO	registrar/registrar.go:97	No registry file found under: /usr/share/filebeat/bin/data/registry. Creating a new registry file.
2018-09-07T12:23:51.469Z	DEBUG	[registrar]	registrar/registrar.go:400	Write registry file: /usr/share/filebeat/bin/data/registry
2018-09-07T12:23:51.527Z	DEBUG	[registrar]	registrar/registrar.go:393	Registry file updated. 0 states written.
2018-09-07T12:23:51.528Z	INFO	registrar/registrar.go:134	Loading registrar data from /usr/share/filebeat/bin/data/registry

I have just realized that it is not using /var/lib/filebeat/registry file ...

The rest of the logs (workaround for post size limit):

2018-09-07T12:23:51.528Z	INFO	registrar/registrar.go:141	States Loaded from registrar: 0
2018-09-07T12:23:51.528Z	WARN	beater/filebeat.go:371	Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2018-09-07T12:23:51.528Z	INFO	crawler/crawler.go:72	Loading Inputs: 6
2018-09-07T12:23:51.528Z	DEBUG	[registrar]	registrar/registrar.go:267	Starting Registrar
2018-09-07T12:23:51.528Z	DEBUG	[processors]	processors/processor.go:66	Processors:
2018-09-07T12:23:51.528Z	DEBUG	[input]	log/config.go:201	recursive glob enabled
2018-09-07T12:23:51.528Z	DEBUG	[input]	log/input.go:147	exclude_files: [\.gz(?-m:$) btmp* wtmp*]. Number of stats: 0
2018-09-07T12:23:51.528Z	DEBUG	[input]	log/input.go:168	input with previous states loaded: 0
2018-09-07T12:23:51.528Z	INFO	log/input.go:138	Configured paths: [/var/log/*]
2018-09-07T12:23:51.528Z	INFO	input/input.go:114	Starting input of type: log; ID: 9391436317143837143
2018-09-07T12:23:51.529Z	DEBUG	[processors]	processors/processor.go:66	Processors:
2018-09-07T12:23:51.529Z	DEBUG	[input]	log/input.go:174	Start next scan
2018-09-07T12:23:51.529Z	DEBUG	[input]	log/config.go:201	recursive glob enabled
2018-09-07T12:23:51.529Z	DEBUG	[input]	log/input.go:255	Exclude file: /var/log/alternatives.log.2.gz
2018-09-07T12:23:51.529Z	DEBUG	[input]	log/input.go:255	Exclude file: /var/log/alternatives.log.3.gz

I'm getting the exact same error running 6.4 after a recent update from 6.2.4. Have you come across a solution yet?

@kvch,
I am not sure when someone will find and fix the bug. In the meantime I just need filebeat to be running even if it will re-ship some logs. Is there any way to keep it running ? I just want to sacrifice logs consistency and have the filebeat running.

The quickest way to get Filebeat running again is to delete the entry of /var/log/syslog along with all entries which Filebeat complains about. This could lead to log duplication.
However, you could rotate the logs files which cause the problem until the logs you know are not yet forwarded to the output.

@kvch,
thanks for prompt response :slight_smile:
Where do I delete the entry of /var/log/syslog from ? From filebeat registry ? Any tool to do that or just plain vi/vim ?

Yes, from the registry file. Just simply delete the lines using vim. Just make sure that the JSON remains syntactically correct.

@kvch,
I have just removed the syslog and auth entry from registry file. Then I ran it again and it was still complaining about auth. It turned out this time filebeat is using /var/lib/filebeat/registry as opposite to /usr/share/filebeat/bin/data/registry from the log above. Why is that ?!?!? What is causing this change ?!?!?

@kvch,
I have just removed all the auth* entries from /var/lib/filebeat/registry (filebeat was down) and after I kicked off filebeat it started complaining about ... auth.log and auth.log.1 again adding this entry to the registry. Any workaround ?

@kvch,
please help with it.

Delete the registry file Filebeat uses. This is a bit dramatic, but it seems Filebeat won't stop complaining.

I was able to solve the problem on my end. It happened because I essentially had two different harvesters trying to read the same causing the error.

In my case, I had configured the filebeat.config.modules to read from my modules.d directory, and I also had configured the same settings in my filebeat.yml file to read the same modules, which caused this error.

In this particular case for @przemolb is there any overlap between your prospectors and modules? Or, which would make more sense since we can't see your entire config file, do you have specific modules configured in your filebeat.yml settings that overlap with your modules.d directory?

5 Likes