Filebeat memory leak associated with failure to open harvesters

Hi folks,

We recently began having problem with runaway memory leaks in Filebeats (observed on versions 6.1.3 and above) on various versions of Scientific Linux 6.x and CentOS 7.x.

It appears to be associated with a permissions-denied error when trying to read user ~/.bash_history files from an NFS auto mount volume. I saw many errors of this type in the filebeat log file on the affected machines:

2018-03-08T06:11:58.464-0800 ERROR log/prospector.go:459 Harvester could not be started on existing file: /home/norman/.bash_history, Err: Error setting up harvester: Harvester setup failed. Unexpected file opening error: Failed opening /home/norman/.bash_history: open /home/norman/.bash_history: permission denied

The filebeat.yml harvester config was

#-------------------------- USER LOGIN SHELL HISTORY  -------------------------
- type: log
  paths:
    - /root/.*history
    - /home/*/.*history
  tags: ["history"]
  tail_files: true

After disabling the /home/*/.*history path and restarting Filebeats, the memory usage no longer appears to be climbing and the error messages in the filebeat log file stopped.

This is not the only path defined in my config that includes wildcards, but it IS the only path defined on an NFS auto-mount volume.

Filebeat is running as root so it doesn't have any problems reading the shell history files if the home directory is actually mounted, but it doesn't handle the dismounting of the user's home directory properly and starts leaking RAM. It does not recover by itself and eventually the affected machines start swapping.

Could you share your full config file? Any chance to test this with the most recent 6.2 release?

BTW thanks a lot for all the details so far. It's great to have that kind of post which should make finding the issue on our end easier.

Hi Ruflin,

I can't share the entire config file due to some proprietary concerns about the nature of our logging, which I don't want to reveal, but I can confirm that the problem persists with Elastic 6.2, Logstash 6.2.2, Filebeat 6.2.2, and it seems to only happen with NFS auto mounted home directories.

We've decided not to try to log things this way in the future so this issue isn't uppermost on my list anymore.

Jeff

In general we recommend not to use Filebeat on mounted volumes as it can have some side effects: https://www.elastic.co/guide/en/beats/filebeat/current/faq.html#filebeat-network-volumes I would still hope we don't leak memory.

I think you mean network-mounted volumes, not just mounted volumes, because not being able to use Beats on mounted volumes would kinda suck. :wink: But it does leak memory; it's 100% reproducible in our environment, regardless of OS (from various Scientific Linux 6.x distros through CentOS 7.4).

Haha, yes definitively network-mounted volumes :slight_smile:

Do you have any suspicious errors in your log file related to file opening or closing? What are the close options you use? Do you see any file which filebeat keeps open which it shouldn't?

Unfortunately I had to remove the logging for the files in the NFS-mounted directories and no longer have any example data to look at. However, the filebeat config didn't have any special directives for closing or opening files, just whatever settings filebeat uses by default.

Since I have a functional workaround I don't need to proceed with this investigation any further for my own benefit.

@jeffkirk1 Thanks for getting back to me. Fully understand that there is not a lot of benefit for you to further investigate and I'm happy you have a working version. I will keep an eye on this issue as it might pop up in other places.

Hi @ruflin, we are also experiencing this memory leak due to failure to setup harvesters (Filebeat 6.2.2). In our case they are local files, so this is not isolated to NFS-mounted directories.

From what I can tell, this comes down to channel.SubOutlet being created for the new harvester, but not cleaned up properly when h.Setup() throws an exception. I think this is because something that happens in harvester.Run() is what actually cleans up the outlet. Another possibility is that the new harvester is getting added to a state or registry structure somewhere and forgotten, so garbage collection isn't reaping it.

Here's some output from pprof. I would attach the pprof file itself, but the web interface here won't let me upload anything that isn't an image. If there's a way to send you that file, please let me know.

(pprof) top10
33159.58kB of 33159.58kB total (  100%)
Dropped 308 nodes (cum <= 165.80kB)
Showing top 10 nodes out of 51 (cum >= 512.04kB)
      flat  flat%   sum%        cum   cum%
19975.92kB 60.24% 60.24% 19975.92kB 60.24%  runtime.malg
 7680.66kB 23.16% 83.40%  7680.66kB 23.16%  github.com/elastic/beats/filebeat/channel.SubOutlet
 2048.19kB  6.18% 89.58%  2048.19kB  6.18%  github.com/elastic/beats/filebeat/prospector/log.NewHarvester
 1357.91kB  4.10% 93.68%  1357.91kB  4.10%  runtime.allgadd
 1024.08kB  3.09% 96.76%  1024.08kB  3.09%  runtime.acquireSudog
  544.67kB  1.64% 98.41%   544.67kB  1.64%  github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker
  528.17kB  1.59%   100%   528.17kB  1.59%  regexp.(*bitState).reset
         0     0%   100%   528.17kB  1.59%  github.com/elastic/beats/filebeat/beater.(*Filebeat).Run
         0     0%   100%   512.04kB  1.54%  github.com/elastic/beats/filebeat/channel.CloseOnSignal.func1
         0     0%   100%   512.04kB  1.54%  github.com/elastic/beats/filebeat/channel.SubOutlet.func1
(pprof) list SubOutlet
Total: 32.38MB
ROUTINE ======================== github.com/elastic/beats/filebeat/channel.SubOutlet in /home/jeremy/src/go/src/github.com/elastic/beats/filebeat/channel/util.go
    7.50MB     7.50MB (flat, cum) 23.16% of Total
         .          .     15:// SubOutlet create a sub-outlet, which can be closed individually, without closing the
         .          .     16:// underlying outlet.
         .          .     17:func SubOutlet(out Outleter) Outleter {
         .          .     18:	s := &subOutlet{
         .          .     19:		isOpen: atomic.MakeBool(true),
       1MB        1MB     20:		done:   make(chan struct{}),
       2MB        2MB     21:		ch:     make(chan *util.Data),
    4.50MB     4.50MB     22:		res:    make(chan bool, 1),
         .          .     23:	}
         .          .     24:
         .          .     25:	go func() {
         .          .     26:		for event := range s.ch {
         .          .     27:			s.res <- out.OnEvent(event) 

Please let me know if there are more details I can provide, and I'm happy to test any potential solutions when we narrow down the problem.

Thanks,
Jeremy

@jdonahue Thanks so much for all the details and the detailed investigation. There seems to be indeed something potentially wrong with the SubOutlet. Could you open a Github issue on copy paste this over? https://github.com/elastic/beats I'm hoping there you can attach any kind of files :crossed_fingers:

Could you share in the Github issue also your config and exact OS?

@ruflin You're welcome! I created a github issue, and was able to attach the full pprof file:

https://github.com/elastic/beats/issues/6797

Please let me know if there's anything else I can do, and thanks for looking into this.

Thanks, lets continue the discussion there.

@ruflin I see the fix for this issue has been merged, which is awesome! Thanks! Do you know which release version will contain this fix?

It will for sure go into 6.3

Fantastic, thanks again!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.