I have been running Filebeat in a DaemonSet using the "autodiscover" feature for a couple of weeks, and all worked fine. But at some point on Friday the Filebeat instances started eating memory until they got killed by the OOM killer. I arrive back at work this morning to discover that one of the pods got restarted 197 times over the weekend, rather than the desired 0.
This is not, sadly, the known issue to do with failing to open files, because
there are no log messages to do with failure to open files
I tried upgrading from 6.2.4 to 6.3.0, which has this issue fixed, and the behaviour continued unchanged.
What would people like to see in the way of logs and/or configuration?
The only slightly suspicious looking things in the logs are DNS lookup timeout messages (of more than one form). I don't know whether or not I was getting these before the OOM failures started.
I'm sorry to read you are having problems with filebeat. For this case it'd be helpful to have a memory profile of your filebeat, you can obtain it with the --memprofile or with --httpprof. Could you get it and share it with us?
I've got it running with --httpprof. I've found somewhere that adding /debug/pprof gets you something other than a 404. But which links would you then like me to click on, and what would you like to see?
You can take a live profile from http://<host>:<port>/debug/pprof/heap. It'd be nice if you could send one to us when the memory increases so we can take a look. If you want you can directly explore it with go tool pprof.
By the way, what memory limit did you set to these pods? Did you change filebeat version last week?
Profile output will follow shortly. The memory configuration is taken directly from the published Kubernetes deployment file (I just copied it without trying to understand it):
I installed Filebeat 6.2.4 a couple of weeks ago. As far as I know it was initially working fine.
At some point last Friday it started misbehaving, with pods restarted frequently due to being out of memory.
As I knew there was a memory leak issue that was fixed in 6.3.0 I installed that, on Friday, to see if it would make any difference, and it appeared not to.
I'm still running 6.3.0, and it still appears to be misbehaving.
I don't really want to try increasing the limit at the moment because there's something really sick with our K8s cluster, and other people think that the Filebeat problems might be the cause, or at least a contributory factor, and they'd rather like me to stop running it for the time being.
We had other reasons to rebuild the K8s cluster. On restarting (and with Metricbeat now running as well) the Filebeat memory usage was flat for a couple of hours and then started increasing again - one pod restarted overnight.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.