I am making use of filebeat 6.21, there are some cases when the include pattern does not match any line in the log file and hence the registry file is being updated frequently.
By looking into the code I came across
filebeat.registry_flush
As this setting is not specified in the documentation. So I wonder if its safe to use this setting, or its only for the experimental purpose for now.
I have set this setting to 100s, and it looks fine for me, but still it may be possible that I might have missed some test cases where this setting would have resulted into some unrecoverable state.
The registry_flush dictates when and how often the registry is serialized to disk. All state changes will be buffered in main memory, before the flush happens. The way the registry works right now is, it keeps all state in memory and the registry file update is basically a snapshot of the current state. With registry_flush: 0 (the default), each ACKed batch of events will trigger a snapshot.
State updates do include file renames and offsets of the last send events. If the state is not flushed yet, but filebeat is restarted, filebeat will have to send already published events again. Filebeat flushes the registry on normal shutdown, but if the machine, or filebeat crashes, or if filebeat is forced to be shutdown, then the final registry flush is missing. This leads to duplicates. As some events can be in the pipeline (not yet being ACKed), also use shutdown_timeout, to reduce the chance of duplicates.
There is no 'perfect' value for registry_flush. It's more of a trade-off between chance of duplicates on crashes and overall disk IO. It's some 'risk' you will have to take as user. The number of duplicates you might experience depends on the event rate and the registry_flush. Roughly estimated to be avg eps * registry_flush.
A missing flush will have filebeat to restart with some old state. But on startup, filebeat resyncs the in-memory registry state, so to continue processing from 'old' offsets.
The setting not being documented is a bug. Please open an issue here. Thanks.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.