Intentionally delaying log harvesting at startup

I have a use case where I need to delay the log harvesting. As I read my best option would be scan_frequency but I'm not sure if that applies for the first start and how it keeps track of this information (I suspect that in memory).
The issue is that when I recreate a container from a snapshot, I can't stop Filebeat soon enough and I get couple thousand documents added to my cluster. In a way that's really good and fast, but in this case, I have to delay the start while I can remove the logs.
Can I achieve my goal by modifying the default scan_frequency value (10s) to a higher one? Is there any other way without adding more services?

@YvorL scan_frequency is only effective after the first scan so I don't it will help.

Are you using Filebeat's autodiscover for managing configuration when new container appear disapear?

No, I'm not using that. I'm running an instance of Filebeat in the container itself. When a new container is created from the snapshot all processes launch including Filebeat. Unfortunately, that's before the logs are removed. You can think about it as restoring a backup from a snapshot. While the registry file contains the information about the processed logs, I believe, that since the environment changed (e.g., hostname) it'll re-read the logs.

I wonder if the tail_files option could help you? https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html#_literal_tail_files_literal Be aware that it can also mean that some events are skipped for new logs.

Unfortunately, it seems it'd cause more issues than it'd solve. This would work if I can add the setting before the snapshot is taken and remove it immediately, but that's far from optimal and would slow down the process unnecessarily. :disappointed:
I need either having an option which would delay the start of scanning & harvesting on first run or one where I can set that the environment change (e.g., hostname) doesn't mean that the logs aren't already processed.

I can't think of a good workaround here at the moment :frowning: One thing we discussed in the past is have a unique id for each log line so in case events are sent twice, they would not be duplicated but we are not there yet.

I see. Thank you for the info!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.