Error when reloading prospectors for filebeat

grantwu · April 20, 2017, 3:38pm

The re-creation of a prospector as a condition for checking whether or not to stop or start a processor in the live reloading path causes prospectors to shut down due to error thrown when trying to load file state.

This line:

github.com

elastic/beats/blob/master/libbeat/cfgfile/reload.go#L136


func (rl *Reloader) Run(runnerFactory RunnerFactory) {
	logp.Info("Config reloader started")


	rl.wg.Add(1)
	defer rl.wg.Done()


	// Stop all running modules when method finishes
	defer rl.stopRunners(rl.registry.CopyList())


	gw := NewGlobWatcher(rl.path)


	// If reloading is disable, config files should be loaded immediately
	if !rl.config.Reload.Enabled {
		rl.config.Reload.Period = 0
	}


	overwriteUpdate := true


	for {
		select {
		case <-rl.done:

Calls this function:
https://github.com/elastic/beats/blob/master/filebeat/prospector/factory.go#L27

The factory function tries to call LoadStates on the prospector and if there is already a prospector running for the files in question, then the factory will return an error. Specifically this error:
https://github.com/elastic/beats/blob/master/filebeat/prospector/prospector_log.go#L55

This causes the loop in the reload to continue which then causes the currently running prospector to be incorrectly shutdown once the loop is exited.

For reference:
I am currently running Filebeat on version 5.3 on CentOS6.9

ruflin · April 21, 2017, 11:02am

As far as I understand you have the following setup:

Prospector 1 running
Prospector 2 config is moved into reload dir, prospector 1 is removed
Reloading happens
Prospector 2 does not start because of Finished, Prospector 1 stops

That is currently the intended behaviour, as the next time the reload happens, it will start Prospector 2. Filebeat must guarantee that at one time no two prospectors with the same harvesters are running, otherwise you will have duplicates events and probably lots of strange side affects.

grantwu · April 21, 2017, 4:20pm

It's actually something more like, Prospector 1 and 2 are running and harvesting different files, either is removed, then both are stopped indefinitely. I don't believe I understand what you mean by the next time the reload happens since reload only happens when there's an update in the prospector config folder. This means the remaining prospector that should still be running could be stopped until another update is made which doesn't seem correct.

I believe adding a 3rd prospector with a different harvester would actually cause 1 and 2 to stop as well.

ruflin · April 24, 2017, 8:08am

That actually sounds like a bug. If Prospector 2 cannot be started, it should not be added to the registry that keeps track on which prospectors are already running. Means it should be started during the next scan. There are potentially two issues here:

If a prospector is not started properly, it is still added to the registry but shouldn't
If a prospector is not started properly, it will not be started again because the scan for new files will return that no files were updated.

Could you open a Github issue for this?

ruflin · April 24, 2017, 8:10am

Looking at the code again, I think point 1 should not apply. If there is an error during loading the state, it will not be added to the registry. Point 2 still applies. But that means if you add a Prospector 3 and Prospector 1 was stopped in the meantime, Prospector 2 and 3 should be started.

grantwu · April 24, 2017, 5:54pm

The issue isn't around the first time a prospector gets started. That logic is fine. The issue is around how the code determines which currently active prospectors/runners should continue to run after a configuration update. The code tries to execute logic that implies if a prospector is already running and has been unchanged in an update, the new prospector's hash/ID will match an element in the registry and thus the active prospector will be removed from the stop list. This unfortunately does not fully work.

The main flaw is around how the factory attempts to load file states before checking if an active prospector matches a prospector defined in the updated configuration. The attempt to load file states will cause an error which means the reload loop will execute a continue and the prospector that should remain running after the reload is stopped since it is never removed from the stop list.

ruflin · April 27, 2017, 12:24pm

Ok, I see your point. Sorry that I missed that before. I need to check why our test do not cover this. We probably have to change where the loading of the state happen. One potential quick fix could be to check inside the error in reload if the prospector id already exists, and if yes, remove it from the delete list.

Could you open a Github issue with that?

ruflin · April 27, 2017, 1:20pm

I opened https://github.com/elastic/beats/pull/4128 with a potential fix. I still need to add tests and check it in some more details.

grantwu · April 27, 2017, 9:50pm

Thanks! Here's the github issue: https://github.com/elastic/beats/issues/4133

system · May 11, 2017, 3:49pm

This topic was automatically closed after 21 days. New replies are no longer allowed.

Topic		Replies	Views
Filebeat 5.6 error Beats filebeat	3	882	February 18, 2018
Reloading prospectors results in ERR in Filebeat 5.5.0 Beats filebeat	5	1646	September 26, 2017
The live_reloading wouldn't caused the original prospector stopped? Beats filebeat	3	281	May 15, 2018
Filebeat - There are conflicts between additional prospectors and reload function Beats	7	722	June 6, 2017
Clarifications regarding reload configuration Beats filebeat	9	1159	January 9, 2018

Error when reloading prospectors for filebeat

Related topics