Filebeat Registry - Will I get duplicates if I delete


I've been an idiot. I added a conditional into my Logstash pipelines which, long story short, has caused me to miss around an hour of logging data over about 18 different agents.

I've since fixed the issue and data is now being indexed correctly. I've read that if I delete the filebeat registry this will cause Filebeat to re-send the log files.

I have two concerns with this:

  1. Will this "duplicate" any events in Elasticsearchy that were indexed correctly?
  2. The if statement only broke certain modules so some log types were still indexed correctly. If I delete the registry would those files then be duplicated?

Duplicated data is probably even worse than the missing data so that is not an option. Even though this was only an hour we're probably talking 100,000+ events missing.

What is the right approach here to "backfill" this data?


The registry essentially keeps track of files that Filebeat is reading or has read. For each file's entry in the registry, Filebeat keeps track of how far (byte offset) into that file Filebeat has already read. This is a simplification but essentially what Filebeat does.

So, yes, deleting the registry will mean that Filebeat doesn't think it's read any files yet. And so it will start from scratch. Which also means that, yes, you will end up with duplicates (without having done some special configuration the first time around anyway).

You said you've missed about an hour of data. Would it be possible for you to figure out which log files correspond to that hour (not just for entries you've missed but for all entries over that hour), temporarily move them out of the logs folder, and then move them back? Filebeat will think these are new log files and re-process them. Of course to avoid duplicates, before you do the moving you'd want to delete the documents in Elasticsearch corresponding to the same hour, perhaps using the Delete By Query API.

In short, rather than take the "hammer" approach of deleting the entire registry, perhaps you could take more of a "scalpel" approach to deleting and then re-ingesting all the logging data from the "bad" hour?

1 Like

@shaunak - Thank you for the reply. For anyone reading this in the future with a similar situation, I basically identified the last logs written before the gap and the first log written once I resolved the logstash config. I've then went into the log files, done some CLI magic (sed, grep etc) to pull out the missing log lines, write them to a new file then re-ingest those via filebeat.

As you said the "scalpel" approach is much better, if a little more time consuming :slight_smile:



This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.