I think I discovered a bug and wanted the community's opinion before I submit an issue on GitHub.
First, this only happens with forwarded events. I haven't been able to reproduce with locally generated events.
When starting winlogbeat, if the registry file does not exist, then it starts processing all events in the monitored event log from the beginning as would be expected.
If the registry file does exist, even if the timestamp/record_number saved is in the past by, say, several minutes, winlogbeat will not resume at that point. Rather, it starts processing events that are just coming in. Events that are between the timestamp saved in the registry file, and the moment that winlogbeat is started, are lost.
Can anybody else reproduce this? Or maybe I'm doing something wrong? I'm running Winlogbeat 5.0.2 on Windows 2012 R2.
The registry file stores the latest value from the record_number field. The record_number field is normally a one up counter. But that assumption is broken with ForwardedEvents because it mixes events from multiple logs and hosts. My theory is that since it stores the latest record_number it's possible that it could read an event with record_number=10 followed by record_number=5. Then on restart it reads from record_number=5 causing some duplication.
One workaround would be to use Logstash with fingerprint filter to create the _id field based on the event's @timestamp + computer_name + log_name + record_number. (@timestamp comes from the event record and not the time Winlogbeat read the event so it's safe to use in a fingerprint.)
I think what you are describing is a possibly related, but separate problem. I also observed this - Winlogbeat will lose some events during normal operation, if events from multiple hosts are mixed in the same event log "box", I noticed.
However, the problem that I am describing, I was able to reproduce it by using a single forwarded host - in this scenario, the record_number progression would be consistent across Winlogbeat restarts. Would you be willing to try and reproduce this?
I think this assumption is wrong. Can you check the data in event viewer? Each event log has its own record_number counter. So multiple logs (Application, Security) being forwarded from a single host are enough to cause the problem.
Fair point, to be honest I hadn't thought of that. However as it happens I'm forwarding only the Security event log from that particular host, so the record_numbers should be in sequence.
I think I know why this is happening. It's all caused by the same problem - the fact that we don't store an XML bookmark. When resuming it creates a bookmark based on the configured name of the log which is ForwardedEvents. Windows doesn't know what to do with that information since there are no records matching that log name.
Do you think this could also be causing events to be lost by Winlogbeat during normal operation (i.e. no restarts involved)? I'm seeing some discrepancies.
I opened https://github.com/elastic/beats/issues/3731 for this (Winlogbeat losing events during normal operation) and referenced this discussion as a possibly related.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.