[Auditbeat] Memory leak in 7.5.2

Hi!
It seems like there's a memory leak in latest version of auditbeat (7.5.2) that is running with socket module enabled on high network loaded servers (such as load balancers).

We recently upgraded to the newest version (7.5.2) from 6.8.0 (which had no similar issues) and started to experience intense memory loads. Not only memory usage was above what we think is normal (around 300-400 MB for a beat), it continued to grow up indefinitely until it went OOM and even crashed some of our servers.

This only applies to our load balancers, memory usage on other type of servers seems to stay within the reasonable limits, although it might be due to the fact that there are not many active sockets.

Some technical information:
Affected version of auditbeat: 7.5.2
Aftected servers operating system: Ubuntu 16.04.5 LTS

Auditbeat config:

auditbeat.modules:

- module: auditd
# Load audit rules from separate files. Same format as audit.rules(7).
audit_rule_files: [ '${path.config}/audit.rules.d/*.conf' ]

- module: file_integrity
paths:
- /bin
- /usr/bin
- /sbin
- /usr/sbin
- /etc
- /root/.ssh/authorized_keys

- module: system
datasets:
- host # General host information, e.g. uptime, IPs
- login # User logins, logouts, and system boots.
- package # Installed, updated, and removed packages
- process # Started and stopped processes
- socket # Opened and closed sockets
- user # User information

# How often datasets send state updates with the
# current state of the system (e.g. all currently
# running processes, all open sockets).
state.period: 1h

# How often auditbeat queries for new processes, sockets etc
metrics.period: 3s

# Enabled by default. Auditbeat will read password fields in
# /etc/passwd and /etc/shadow and store a hash locally to
# detect any changes.
user.detect_password_changes: true

# File patterns of the login record files.
login.wtmp_file_pattern: /var/log/wtmp*
login.btmp_file_pattern: /var/log/btmp*
output.logstash:
hosts: ***
ssl.certificate_authorities: ***

bulk_max_size: 2096
timeout: 15

setup.template.name: "logstash_auditbeat_template"
setup.template.pattern: "logstash-auditbeat-*"
setup.template.settings:
index.number_of_shards: 3
index.refresh_interval: 30s

processors:
- add_host_metadata:
netinfo.enabled: true
cache.ttl: 5m

We have the same configuration on all servers where auditbeat is deployed.

Also attaching screenshots with memory usage on load balancer and other example server over last couple days (sudden drop of memory usage on load balancer was due to restarting auditbeat daemon). As you can see, it looks pretty okay on the server that doesn't have a load balancer role.

Attaching latest profiler outputs from one of the affected servers. Socket module is definitely in the tops for memory usage and it grows endlessly.

go tool pprof http://localhost:8888/debug/pprof/heap
Fetching profile over HTTP from http://localhost:8888/debug/pprof/heap
Saved profile in /home/user/pprof/pprof.auditbeat.alloc_objects.alloc_space.inuse_objects.inuse_space.021.pb.gz
File: auditbeat
Build ID: b9be0e911c5db76838425b6aa234b98f3defe260
Type: inuse_space
Time: Feb 7, 2020 at 12:38pm (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 87.94MB, 100% of 87.94MB total
Showing top 10 nodes out of 68
      flat  flat%   sum%        cum   cum%
   83.25MB 94.66% 94.66%    83.25MB 94.66%  github.com/elastic/beats/x-pack/auditbeat/module/system/socket.(*state).createFlow
       1MB  1.14% 95.80%        1MB  1.14%  bytes.makeSlice
       1MB  1.14% 96.94%        1MB  1.14%  crypto/x509.parseCertificate
    0.60MB  0.68% 97.62%     0.60MB  0.68%  github.com/elastic/beats/auditbeat/module/auditd.(*MetricSet).receiveEvents
    0.58MB  0.66% 98.28%     0.58MB  0.66%  github.com/elastic/beats/x-pack/auditbeat/module/system/socket.(*dnsTracker).AddTransaction
    0.51MB  0.58% 98.86%     0.51MB  0.58%  regexp/syntax.(*compiler).inst
    0.50MB  0.57% 99.43%     0.50MB  0.57%  github.com/elastic/beats/vendor/github.com/json-iterator/go.init.4
    0.50MB  0.57%   100%     0.50MB  0.57%  github.com/elastic/beats/libbeat/outputs/kafka.init.ializers
         0     0%   100%        1MB  1.14%  bytes.(*Buffer).Write
         0     0%   100%        1MB  1.14%  bytes.(*Buffer).grow

Hi, can you share a profile created with (--memprofile file) and logs with -d socket (but not -d '*')

I'm curious about the lines containing state flows=NNN [...] in particular. Seems that flows are created but not expired for some reason.

Also, does do you see it grow also on the other systems? Maybe just slowly so it doesn't become a problem so soon.

@adrisr,

Attaching new profile as requested (couldn't find the way to attach files here so uploaded via firefox send):
link

Please let me know if any other information is required.

Meanwhile, mem usage on the host reached 660 Mb/2Gb total and keep growing. There's a definite leak.

Not quite, others are stable, but they have significantly lower number of sockets as well.

@adrisr You were asking for data from other hosts (that are not load balancers).
As you can see on the screenshot, memory looks stable during last 72 hours.

Shall we create a github issue for a confirmed bug, or you can not confirm it yet?

@nickbabkin unfortunately there is not a lot of information in this profile. It only accounts for 3MB of allocated objects. I understand you got it via the -memprof argument which would have written the profile when the beat is terminated and most memory has been freed.

Can you try running Auditbeat with -httpprof :8888 (or any other port number) and then wait until the memory it's high and fetch a profile using:

curl 'http://localhost:8888/debug/pprof/heap?debug=1&gc' -o mem_profile.txt
curl 'http://localhost:8888/debug/pprof/goroutine?debug=1' -o goroutines.txt

Also, did you have a chance to grep Auditbeat logs for the lines containing state flows= ?

Hi @adrisr,

Here are the latest memprofile and goroutines after we ran the beat for 3 days:

As per sockets, is there any way to send this file privately to you? It has some sensitive data inside, and we have an NDA with Elastic.

Thank you.

Sure, just send me a private message here or an email to adrian at elastic co

Sockets file is here:

Password is sent to your email. Thanks!

From the logs it's clear that the number of monitored sockets is increasing constantly, going from about 400 to 64k in 5 days. This means either there is this many sockets open in your system (some app is creating and never closing them) or that Auditbeat is missing the inet_release events that cause it to cleanup sockets, about 8 events of this type lost per minute, but I don't see that many lost events in the logs.

Can you compare the number of open sockets in the system vs Auditbeat's state?

$ grep 'state flows=' logs/auditbeat | tail -1 | grep -o 'sockets=[0-9]*'

vs

$ sudo ss -pntu | wc -l

@adrisr,

I think we now have a clear indication that something is very wrong in the code:

live2:root@balancer:~# grep 'state flows=' /var/log/syslog* | tail -1 | grep -o 'sockets=[0-9]*'
sockets=64275

live2:root@balancer:~# ss -pntu | wc -l
553

Thanks @nickbabkin,

I can implement some expiration for sockets (as there is already for flows), so that they are cleaned up after some time. But I'd really would like to understand first what's going on, it still doesn't look like it's losing the socket close events.

Can you share with me some logs with the socketdetailed selector? It needs to be enabled separatedly because it's really really verbose.

Either logging.selectors: ["*", "socketdetailed"] or the -d '*,socketdetailed' command-line flag.

It'll log an extra 15-20k lines per second on your system. That will help see which events is Auditbeat receiving and in which order. Try to have at least 5min worth of logs please.

Thanks again for your collaboration.

HI @adrisr,

Here's the information that you requested (beware, it's more than 1Gb when unarchived):

Password is the same as for the previous file, you can find it in the email.

Thanks and looking forward to the fix.

Hi @adrisr,

Any updates here, were you able to find a reason for the leak?

Thank you!

Hi @nickbabkin, I will prepare a fix soon when I find some time. Will you be able to test it?

Hi @adrisr,

Yes, we'll try to help with testing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

We've prepared a patched version of 7.6.2 for testing. Can you give it a try?

Here's the packages: https://drive.google.com/drive/folders/1r7nH1v7Oc3e2BRsJZ6Wuuz69gfa2DLwh?usp=sharing

And here's the PR with the changes: https://github.com/elastic/beats/pull/17500

Hi @adrisr,

I gave it to our DevOps guys and will share the details as soon as I have them.

Thanks for the fix.