Filebeat memory leak on Kubernetes?

TimWard · June 25, 2018, 8:07am

I have been running Filebeat in a DaemonSet using the "autodiscover" feature for a couple of weeks, and all worked fine. But at some point on Friday the Filebeat instances started eating memory until they got killed by the OOM killer. I arrive back at work this morning to discover that one of the pods got restarted 197 times over the weekend, rather than the desired 0.

This is not, sadly, the known issue to do with failing to open files, because

there are no log messages to do with failure to open files
I tried upgrading from 6.2.4 to 6.3.0, which has this issue fixed, and the behaviour continued unchanged.

What would people like to see in the way of logs and/or configuration?

The only slightly suspicious looking things in the logs are DNS lookup timeout messages (of more than one form). I don't know whether or not I was getting these before the OOM failures started.

jsoriano · June 25, 2018, 11:09am

Hi @TimWard,

I'm sorry to read you are having problems with filebeat. For this case it'd be helpful to have a memory profile of your filebeat, you can obtain it with the --memprofile or with --httpprof. Could you get it and share it with us?

TimWard · June 25, 2018, 11:50am

I've got it running with --httpprof. I've found somewhere that adding /debug/pprof gets you something other than a 404. But which links would you then like me to click on, and what would you like to see?

jsoriano · June 25, 2018, 2:55pm

You can take a live profile from http://<host>:<port>/debug/pprof/heap. It'd be nice if you could send one to us when the memory increases so we can take a look. If you want you can directly explore it with go tool pprof.

By the way, what memory limit did you set to these pods? Did you change filebeat version last week?

TimWard · June 25, 2018, 3:24pm

Profile output will follow shortly. The memory configuration is taken directly from the published Kubernetes deployment file (I just copied it without trying to understand it):

     securityContext:
       runAsUser: 0
     resources:
       limits:
         memory: 200Mi
       requests:
         cpu: 100m
         memory: 100Mi

The sequence of upgrades was as follows:

I installed Filebeat 6.2.4 a couple of weeks ago. As far as I know it was initially working fine.
At some point last Friday it started misbehaving, with pods restarted frequently due to being out of memory.
As I knew there was a memory leak issue that was fixed in 6.3.0 I installed that, on Friday, to see if it would make any difference, and it appeared not to.
I'm still running 6.3.0, and it still appears to be misbehaving.

jsoriano · June 25, 2018, 3:28pm

While we find what can be happening, could you try to increase the memory limit?

TimWard · June 25, 2018, 3:42pm

What's the best way of getting 4,100 lines of profile to you - it'll only let me attach image files?

This suggests that memory was growing up to the point at which I took the profile:

[root@filebeat-deployment-7d8b947947-kv48z filebeat]# ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.0  0.6 626728 105964 ?       Ssl  15:16   0:11 filebeat -c /etc/filebeat.yml -e --httpprof :8808
root        18  0.0  0.0  11832  2916 ?        Ss   15:34   0:00 /bin/bash
root        33  0.0  0.0  51720  3456 ?        R+   15:34   0:00 ps -aux
[root@filebeat-deployment-7d8b947947-kv48z filebeat]# ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.0  0.6 626728 108852 ?       Ssl  15:16   0:12 filebeat -c /etc/filebeat.yml -e --httpprof :8808
root        18  0.0  0.0  11832  2916 ?        Ss   15:34   0:00 /bin/bash
root        34  0.0  0.0  51720  3444 ?        R+   15:35   0:00 ps -aux
[root@filebeat-deployment-7d8b947947-kv48z filebeat]# ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.0  0.7 638092 116572 ?       Ssl  15:16   0:13 filebeat -c /etc/filebeat.yml -e --httpprof :8808
root        18  0.0  0.0  11832  2916 ?        Ss   15:34   0:00 /bin/bash
root        36  0.0  0.0  51720  3440 ?        R+   15:37   0:00 ps -aux
[root@filebeat-deployment-7d8b947947-kv48z filebeat]# ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.0  0.7 645748 120616 ?       Ssl  15:16   0:14 filebeat -c /etc/filebeat.yml -e --httpprof :8808
root        18  0.0  0.0  11832  2916 ?        Ss   15:34   0:00 /bin/bash
root        37  0.0  0.0  51720  3448 ?        R+   15:38   0:00 ps -aux
[root@filebeat-deployment-7d8b947947-kv48z filebeat]# ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.0  0.7 648216 124672 ?       Ssl  15:16   0:14 filebeat -c /etc/filebeat.yml -e --httpprof :8808
root        18  0.0  0.0  11832  2916 ?        Ss   15:34   0:00 /bin/bash
root        38  0.0  0.0  51720  3308 ?        R+   15:39   0:00 ps -aux
[root@filebeat-deployment-7d8b947947-kv48z filebeat]#

I don't really want to try increasing the limit at the moment because there's something really sick with our K8s cluster, and other people think that the Filebeat problems might be the cause, or at least a contributory factor, and they'd rather like me to stop running it for the time being.

TimWard · June 28, 2018, 8:04am

We had other reasons to rebuild the K8s cluster. On restarting (and with Metricbeat now running as well) the Filebeat memory usage was flat for a couple of hours and then started increasing again - one pod restarted overnight.

exekias · June 28, 2018, 8:49am

Hi @TimWard,

Do you see any errors in Filebeat logs?

Best regards

TimWard · July 2, 2018, 11:08am

Not consistently (there were some apparently intermittent DNS resolution errors for some periods, but not others).

TimWard · July 9, 2018, 10:35am

We have increased the Filebeat memory limit to 500M and it now only gets OOM killed every couple of days rather than every couple of hours.

system · August 6, 2018, 10:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.