We recently began having problem with runaway memory leaks in Filebeats (observed on versions 6.1.3 and above) on various versions of Scientific Linux 6.x and CentOS 7.x.
It appears to be associated with a permissions-denied error when trying to read user ~/.bash_history files from an NFS auto mount volume. I saw many errors of this type in the filebeat log file on the affected machines:
2018-03-08T06:11:58.464-0800 ERROR log/prospector.go:459 Harvester could not be started on existing file: /home/norman/.bash_history, Err: Error setting up harvester: Harvester setup failed. Unexpected file opening error: Failed opening /home/norman/.bash_history: open /home/norman/.bash_history: permission denied
The filebeat.yml harvester config was
#-------------------------- USER LOGIN SHELL HISTORY -------------------------
- type: log
paths:
- /root/.*history
- /home/*/.*history
tags: ["history"]
tail_files: true
After disabling the /home/*/.*history path and restarting Filebeats, the memory usage no longer appears to be climbing and the error messages in the filebeat log file stopped.
This is not the only path defined in my config that includes wildcards, but it IS the only path defined on an NFS auto-mount volume.
Filebeat is running as root so it doesn't have any problems reading the shell history files if the home directory is actually mounted, but it doesn't handle the dismounting of the user's home directory properly and starts leaking RAM. It does not recover by itself and eventually the affected machines start swapping.
I can't share the entire config file due to some proprietary concerns about the nature of our logging, which I don't want to reveal, but I can confirm that the problem persists with Elastic 6.2, Logstash 6.2.2, Filebeat 6.2.2, and it seems to only happen with NFS auto mounted home directories.
We've decided not to try to log things this way in the future so this issue isn't uppermost on my list anymore.
I think you mean network-mounted volumes, not just mounted volumes, because not being able to use Beats on mounted volumes would kinda suck. But it does leak memory; it's 100% reproducible in our environment, regardless of OS (from various Scientific Linux 6.x distros through CentOS 7.4).
Do you have any suspicious errors in your log file related to file opening or closing? What are the close options you use? Do you see any file which filebeat keeps open which it shouldn't?
Unfortunately I had to remove the logging for the files in the NFS-mounted directories and no longer have any example data to look at. However, the filebeat config didn't have any special directives for closing or opening files, just whatever settings filebeat uses by default.
Since I have a functional workaround I don't need to proceed with this investigation any further for my own benefit.
@jeffkirk1 Thanks for getting back to me. Fully understand that there is not a lot of benefit for you to further investigate and I'm happy you have a working version. I will keep an eye on this issue as it might pop up in other places.
Hi @ruflin, we are also experiencing this memory leak due to failure to setup harvesters (Filebeat 6.2.2). In our case they are local files, so this is not isolated to NFS-mounted directories.
From what I can tell, this comes down to channel.SubOutlet being created for the new harvester, but not cleaned up properly when h.Setup() throws an exception. I think this is because something that happens in harvester.Run() is what actually cleans up the outlet. Another possibility is that the new harvester is getting added to a state or registry structure somewhere and forgotten, so garbage collection isn't reaping it.
Here's some output from pprof. I would attach the pprof file itself, but the web interface here won't let me upload anything that isn't an image. If there's a way to send you that file, please let me know.
(pprof) top10
33159.58kB of 33159.58kB total ( 100%)
Dropped 308 nodes (cum <= 165.80kB)
Showing top 10 nodes out of 51 (cum >= 512.04kB)
flat flat% sum% cum cum%
19975.92kB 60.24% 60.24% 19975.92kB 60.24% runtime.malg
7680.66kB 23.16% 83.40% 7680.66kB 23.16% github.com/elastic/beats/filebeat/channel.SubOutlet
2048.19kB 6.18% 89.58% 2048.19kB 6.18% github.com/elastic/beats/filebeat/prospector/log.NewHarvester
1357.91kB 4.10% 93.68% 1357.91kB 4.10% runtime.allgadd
1024.08kB 3.09% 96.76% 1024.08kB 3.09% runtime.acquireSudog
544.67kB 1.64% 98.41% 544.67kB 1.64% github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker
528.17kB 1.59% 100% 528.17kB 1.59% regexp.(*bitState).reset
0 0% 100% 528.17kB 1.59% github.com/elastic/beats/filebeat/beater.(*Filebeat).Run
0 0% 100% 512.04kB 1.54% github.com/elastic/beats/filebeat/channel.CloseOnSignal.func1
0 0% 100% 512.04kB 1.54% github.com/elastic/beats/filebeat/channel.SubOutlet.func1
(pprof) list SubOutlet
Total: 32.38MB
ROUTINE ======================== github.com/elastic/beats/filebeat/channel.SubOutlet in /home/jeremy/src/go/src/github.com/elastic/beats/filebeat/channel/util.go
7.50MB 7.50MB (flat, cum) 23.16% of Total
. . 15:// SubOutlet create a sub-outlet, which can be closed individually, without closing the
. . 16:// underlying outlet.
. . 17:func SubOutlet(out Outleter) Outleter {
. . 18: s := &subOutlet{
. . 19: isOpen: atomic.MakeBool(true),
1MB 1MB 20: done: make(chan struct{}),
2MB 2MB 21: ch: make(chan *util.Data),
4.50MB 4.50MB 22: res: make(chan bool, 1),
. . 23: }
. . 24:
. . 25: go func() {
. . 26: for event := range s.ch {
. . 27: s.res <- out.OnEvent(event)
@jdonahue Thanks so much for all the details and the detailed investigation. There seems to be indeed something potentially wrong with the SubOutlet. Could you open a Github issue on copy paste this over? https://github.com/elastic/beats I'm hoping there you can attach any kind of files
Could you share in the Github issue also your config and exact OS?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.