Filebeat beta1 resends random data upon every restart (registry file not updated properly?)

Hi,

Filebeat Version: beta1
Logstash Version: alpha2
System: CentOS6

In my test scenario filebeat's processing around 1k files that are something around 40GB a day in total. The problem is that upon every restart of filebeat some (but not full) subset of files is send again for no apparent reason.

My initial idea here was that during filebeat shutdown some events were unacked by logstash and filebeat just shuts down too fast without waiting for them. But that seems to be impossible since number of events that are replied is few order of magnitude bigger than my original queue_size number.

All files are updated in last <24hrs. The only thing that Filebeat says about file that are rescanned is this upon every restart:

2016-09-29T12:16:02Z INFO Harvester started for file: /path/to/file

Prospector and other potentially related settings I have are:

  clean_inactive: 48h
  clean_removed: true
  close_inactive: 30m
  close_removed: true
  ignore_older: 24h
  tail_files: false

Ok, it seems there has to be bug in clean_removed logic.

Despite the above settings and the fact that file was available on disk for all of that time it looks like it's being occasionally removed from registry file:

# grep -r /path/to/problematic/file /var/lib/filebeat/registry
#

File was updated in last hour.

Changing clean_removed: to false prevents files from disappearing from registry.

Can you share some details on how you rotate the files? Do you use a network drive?

For reference: Here is the related Github issue: https://github.com/elastic/beats/issues/2652 Lets post all details here first and jump to the issue as soon as we nailed down the problem.

Hi @ruflin

  • I do not use network/shared drive in this case,
  • files are created with /logpath/$HOST-$YEAR-$MONTH-$DAY-$PROGRAM macro values in path
  • files are deleted/archieved the day after. after this they aren't in prospector path any more

Most importantly, resending data happens after filebeat restart on files within $CURRENT_DAY (so created/modified in last <24hrs) and before files are hitting ignore_older or clean_inactive.

Also as a proof that's not a case of moving/rotating this file anyhow :: "path" and "offset" metadata from Filebeat are the same.

All of this only happens when clean_removed is set to true.

For my understanding:

  • All your files have a unique path, not file rotation happens
  • Data is resent only on restart

When you were running the grep check above, was filebeat running or was it stopped? The thing I'm trying to figure out is if it only happens during shutdown or also if filebeat is running.

It would be very useful to have some log outputs from your filebeat instance as there it should show when a state was removed including the reason (age).

Your assumptions are correct.

However I see something very interesting now. I'm on a box now where registry file wasn't updated for several hours.

[root@host1 filebeat]# date
Mon Oct  3 16:21:38 UTC 2016

[root@host1 filebeat]# ls -l /var/lib/filebeat/registry
-rw-r----- 1 filebeat filebeat 1223255 Oct  3 02:04 /var/lib/filebeat/registry

And it also looks like the very last entry having either "regist" or "state" was also at this time.

[root@host1 filebeat]# grep states /var/log/filebeat/filebeat.log | tail -n 1
2016-10-03T02:04:53Z INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=2086 libbeat.logstash.published_and_acked_events=2293 libbeat.logstash.call_count.PublishEvents=10 libbeat.logstash.publish.read_bytes=378 libbeat.logstash.publish.write_bytes=1211622 libbeat.publisher.messages_in_worker_queues=9 registrar.writes=2 registrar.states.update=4274

There is nothing related to registry/states afterwards.

And I'd like to stress out that there is plenty of other events in filebeat log (like harvester ones) and Filebeat operates completely fine and it's also delivering events all the time without any issue.

[root@host1 filebeat]# ps -eo comm,lstart | grep filebeat
filebeat-god    Mon Oct  3 01:58:22 2016
filebeat        Mon Oct  3 01:58:22 2016

So the time was at follow
01:58 filebeat started
02:04 registry file was updated for last time
02:04 onwards :: delivering events normally

In other words on-disc registry file was not updated for 14hrs.

Restarting triggered entire data set to be rescanned. While stopping filebeat claimed that registry file was updated to current state. Unfortunately I did not copy it, so can check if it was updated compared to proper values. I believe I'll be able to reproduce this, so I'll post more details.

Judging from what I see in ES, all events after 02:04 were reimported again.

Ok, I actually believe those are separate issues as I have other box where file was removed from registry, but this time registry file was updated all the time.

2016-10-03T16:18:18Z Filebeat started 
2016-10-03T16:18:19Z INFO Harvester started for file: /problematic

[...]
Somewhere in between "/problematic" file disappeared from registry file, without any mention in logs.
Process was running all the time.
[...]

2016-10-03T17:21:28Z INFO filebeat stopped.
2016-10-03T17:21:43Z Filebeat got restarted
2016-10-03T17:21:43Z INFO Harvester started for file: /problematic

File was also re-added to registry

[root@host2 filebeat]# grep -l /problematic /var/lib/filebeat/registry
/var/lib/filebeat/registry

In less than 10 minutes file disappeared from registry. There is no mention of that operation in filebeat's logs.

[root@host2 filebeat]# grep -l /problematic /var/lib/filebeat/registry
[root@host2 filebeat]# 

This is completely repeatable.

There is one change in master which is not in the beta1 release: https://github.com/elastic/beats/pull/2552 In the past there were some undetected issue with writing the registry file. Even though filebeat kept running. This has changed now that filebeat checks on startup if it can write to the file. Could you check what happens if you run the nightly build of filebeat? https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat The nightly build should be fully compatible with the beta1 build, no changes to the config need.

I start to get the suspicion that there is an issue with writing the registry file. Before you start filebeat, is there no registry file or do you still have an old one around? In case none is around, will it be created by filebeat?

I start to get the suspicion that there is an issue with writing the registry file. Before you start filebeat, is there no registry file or do you still have an old one around? In case none is around, will it be created by filebeat?

There was a registry file. Also keep in mind that registry file was actually updated properly after restart, so I don't believe that's a permission/storage issue. It suddenly stopped being updated while filebeat was already running.

Also could please take a look at my 2nd post? As I believe that's different thing.
That file just disappeared from running registry for no apparent reason :frowning:

Sorry, somehow missed the update to this post. It is indeed a strange behaviour that something is removed from the registry without having something in the logs show up.

You mention that this is fully reproducible? Can you list the exact steps? So far I couldn't reproduce it but that is always tricky with things that happen over a longer time period. Looks like your test was running for an hour?

Full repro case was described in that comment Filebeat beta1 resends random data upon every restart (registry file not updated properly?)

In short it's enough to start filebeat, wait a little for and restart again.

I still have some boxes running with this configuration, please let me know if you need any additional information.

It would be very helpful if you could share the Filebeat log files with me. You can also share them via email if you prefer that.

@shog Interesting a bug was just discovered in the clean_removed part, but it was that states were not removed (kind of the opposite). As it is the same code that does all the logic I can't really see how the two could be related but I thought it is worth mentioning it: https://github.com/elastic/beats/issues/2646

This topic was automatically closed after 21 days. New replies are no longer allowed.