How to avoid duplicate data when switching from filebeat's log-input to filestream-input?

Hey guys,

right now, I'm planning the switch from filebeat's log-input to filestream-input.

We have many filebeats running on our numerous servers that are harvesting many log files. We deploy them with an Ansible Playbook.
When we deploy a new filebeat version, the playbook copies the data directory (containing the registry) from the current filebeat to the new one's directory. Doing this ensures that the new filebeat starts harvesting where the now old one stopped and we won't lose any data nor will duplicates get ingested in our cluster.

Now, the switch to filestream inputs will ignore any previous registry entries because there are no entries for filestream-inputs yet. In consequence, all filebeats will start harvesting from the beginning of each log file. This would lead to a massive amount of duplicate documents in our Elasticsearch cluster.

Is there any handy way to avoid this without much hazzle? How did you guys think the switch from log-inputs to filestream-inputs should happen in order to avoid this?

Thanks in advance.

EDIT:
I don't think that the ignore_older setting is the key to avoid this issue because we have log files that do not rotate (I know this is anything but ideal but we can't change it at the moment).
But even with log rotation there is no setting that lets you filebeat start harvesting from an exact offset/entry.

My initial thought is to point file stream at a different directory and then have ur application point it's logs to that. You would keep the old log file input pointed at the old directory until it finishes reading the remainder of the "old" logs. Then u can remove it and you'll only have the file stream input going forward.

Thanks Alex for your reply.

I get the point of your idea but that would be a really hazzle...we'd have to re-deploy all of our applications just for that.

I don't think you can avoid this, some of the improvements in the filestream input is how it writes the registry, so a registry of the log input would not work for the filestream input.

These are two changes related to the registry:

Only the most recent updates are serialized to the registry. In contrast, the log input has to serialize the complete registry on each ACK from the outputs. This makes the registry updates much quicker with this input.

The input ensures that only offsets updates are written to the registry append only log. The log writes the complete file state.

So, this makes clear that the format of the registry file used by the filestream input is different from the registry file used by the log input.

We recently merged a feature to allow for an "automated" migration of log input to filestream input: https://github.com/elastic/beats/pull/34292. You still need manually migrate the configuration from log input to filestream input, but the state is migrated automatically once Filebeat starts. A backup of the registry is also made, so you can revert everything to the previous state.

The PR contains a detailed example of how to configure the migration as well as all the documentation.

It should be available on the next release.

2 Likes

@TiagoQueiroz this is exactly what I was searching for - amazing! Thanks a lot!

Since the log-input hasn't been removed yet, we'll do the stack upgrade without migrating to filestream-input and wait for the next release containing this feature.

1 Like

The log input won't be removed in the 8.x series, that would be a breaking change. So you don't need to worry about it disappearing on a upgrade :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.