Filebeat Registry - Will I get duplicates if I delete

stevesimpson · April 21, 2020, 1:49pm

Hi,

I've been an idiot. I added a conditional into my Logstash pipelines which, long story short, has caused me to miss around an hour of logging data over about 18 different agents.

I've since fixed the issue and data is now being indexed correctly. I've read that if I delete the filebeat registry this will cause Filebeat to re-send the log files.

I have two concerns with this:

Will this "duplicate" any events in Elasticsearchy that were indexed correctly?
The if statement only broke certain modules so some log types were still indexed correctly. If I delete the registry would those files then be duplicated?

Duplicated data is probably even worse than the missing data so that is not an option. Even though this was only an hour we're probably talking 100,000+ events missing.

What is the right approach here to "backfill" this data?

Thanks

shaunak · April 21, 2020, 5:13pm

The registry essentially keeps track of files that Filebeat is reading or has read. For each file's entry in the registry, Filebeat keeps track of how far (byte offset) into that file Filebeat has already read. This is a simplification but essentially what Filebeat does.

So, yes, deleting the registry will mean that Filebeat doesn't think it's read any files yet. And so it will start from scratch. Which also means that, yes, you will end up with duplicates (without having done some special configuration the first time around anyway).

You said you've missed about an hour of data. Would it be possible for you to figure out which log files correspond to that hour (not just for entries you've missed but for all entries over that hour), temporarily move them out of the logs folder, and then move them back? Filebeat will think these are new log files and re-process them. Of course to avoid duplicates, before you do the moving you'd want to delete the documents in Elasticsearch corresponding to the same hour, perhaps using the Delete By Query API.

In short, rather than take the "hammer" approach of deleting the entire registry, perhaps you could take more of a "scalpel" approach to deleting and then re-ingesting all the logging data from the "bad" hour?

stevesimpson · April 23, 2020, 8:41am

@shaunak - Thank you for the reply. For anyone reading this in the future with a similar situation, I basically identified the last logs written before the gap and the first log written once I resolved the logstash config. I've then went into the log files, done some CLI magic (sed, grep etc) to pull out the missing log lines, write them to a new file then re-ingest those via filebeat.

As you said the "scalpel" approach is much better, if a little more time consuming
Before:

After:

system · May 21, 2020, 8:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FileBeats -Are there any ways we can delete the log files after file beat harvest the data to logstash Beats filebeat	6	12774	May 27, 2019
Moving from Logstash to Filebeat => no duplicate log Beats filebeat	3	1274	January 4, 2017
Deleting Filebeat Registry File Beats filebeat	7	25581	July 5, 2017
Filebeats is re-processing logs once it restarts Beats filebeat	6	4627	April 18, 2018
Filebeat use insane memory Beats	21	4157	April 10, 2017

Filebeat Registry - Will I get duplicates if I delete

Related topics