Error when reloading prospectors for filebeat

The re-creation of a prospector as a condition for checking whether or not to stop or start a processor in the live reloading path causes prospectors to shut down due to error thrown when trying to load file state.

This line:

Calls this function:
https://github.com/elastic/beats/blob/master/filebeat/prospector/factory.go#L27

The factory function tries to call LoadStates on the prospector and if there is already a prospector running for the files in question, then the factory will return an error. Specifically this error:
https://github.com/elastic/beats/blob/master/filebeat/prospector/prospector_log.go#L55

This causes the loop in the reload to continue which then causes the currently running prospector to be incorrectly shutdown once the loop is exited.

For reference:
I am currently running Filebeat on version 5.3 on CentOS6.9

As far as I understand you have the following setup:

  1. Prospector 1 running
  2. Prospector 2 config is moved into reload dir, prospector 1 is removed
  3. Reloading happens
  4. Prospector 2 does not start because of Finished, Prospector 1 stops

That is currently the intended behaviour, as the next time the reload happens, it will start Prospector 2. Filebeat must guarantee that at one time no two prospectors with the same harvesters are running, otherwise you will have duplicates events and probably lots of strange side affects.

It's actually something more like, Prospector 1 and 2 are running and harvesting different files, either is removed, then both are stopped indefinitely. I don't believe I understand what you mean by the next time the reload happens since reload only happens when there's an update in the prospector config folder. This means the remaining prospector that should still be running could be stopped until another update is made which doesn't seem correct.

I believe adding a 3rd prospector with a different harvester would actually cause 1 and 2 to stop as well.

That actually sounds like a bug. If Prospector 2 cannot be started, it should not be added to the registry that keeps track on which prospectors are already running. Means it should be started during the next scan. There are potentially two issues here:

  • If a prospector is not started properly, it is still added to the registry but shouldn't
  • If a prospector is not started properly, it will not be started again because the scan for new files will return that no files were updated.

Could you open a Github issue for this?

Looking at the code again, I think point 1 should not apply. If there is an error during loading the state, it will not be added to the registry. Point 2 still applies. But that means if you add a Prospector 3 and Prospector 1 was stopped in the meantime, Prospector 2 and 3 should be started.

The issue isn't around the first time a prospector gets started. That logic is fine. The issue is around how the code determines which currently active prospectors/runners should continue to run after a configuration update. The code tries to execute logic that implies if a prospector is already running and has been unchanged in an update, the new prospector's hash/ID will match an element in the registry and thus the active prospector will be removed from the stop list. This unfortunately does not fully work.

The main flaw is around how the factory attempts to load file states before checking if an active prospector matches a prospector defined in the updated configuration. The attempt to load file states will cause an error which means the reload loop will execute a continue and the prospector that should remain running after the reload is stopped since it is never removed from the stop list.

Ok, I see your point. Sorry that I missed that before. I need to check why our test do not cover this. We probably have to change where the loading of the state happen. One potential quick fix could be to check inside the error in reload if the prospector id already exists, and if yes, remove it from the delete list.

Could you open a Github issue with that?

I opened https://github.com/elastic/beats/pull/4128 with a potential fix. I still need to add tests and check it in some more details.

Thanks! Here's the github issue: https://github.com/elastic/beats/issues/4133

This topic was automatically closed after 21 days. New replies are no longer allowed.