We are looking to use the o365 module from filebeat to gather logs from the Office365 API and we have one question that is not adressed in the documentation (or I haven't find mention about it).
In order to deal with load balancing and/or availability of the collect systems, we are willing to deploy multiple filebeat instances. Furthermore, they will be deployed in a Kubernetes instances using the official filebeat docker.
The question is how will those multiple filebeat instances be able to coordinate themselves to not gather the same data multiple times? If I understand how things work, it is only based on the registry files from each filebeat instances. Is there a way to share between those a common state?
I don't think Filebeat supports this. What you probably want to do since you're planning on running on Kubernetes.
Run 1 container/pod
Ensure your Filebeat config is either handled via an environmental variable, or mounted as a configmap.
Have a persistent network volume (NFS, iSCSI, etc.) mounted to the container, where Filebeat stores its registry. (I believe Filebeat stores O365 logging points here, though might be incorrect, haven't actually used this module yet)
This setup is probably the closet you can get to an HA setup. In theory if everything is setup correctly, Filebeat will only go down for a few seconds, as it moves to another Kubernetes node, then pickup from where it left off by reading the mounted registry file.
(Edit)
The other option, which I highly don't recommend as it can cause performance issues, is do what the Okta module does. You can copy o365.audit.Id to _id then no duplicates can be added.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.