Hi!
It seems like there's a memory leak in latest version of auditbeat (7.5.2) that is running with socket module enabled on high network loaded servers (such as load balancers).
We recently upgraded to the newest version (7.5.2) from 6.8.0 (which had no similar issues) and started to experience intense memory loads. Not only memory usage was above what we think is normal (around 300-400 MB for a beat), it continued to grow up indefinitely until it went OOM and even crashed some of our servers.
This only applies to our load balancers, memory usage on other type of servers seems to stay within the reasonable limits, although it might be due to the fact that there are not many active sockets.
Some technical information:
Affected version of auditbeat: 7.5.2
Aftected servers operating system: Ubuntu 16.04.5 LTS
Auditbeat config:
auditbeat.modules:
- module: auditd
# Load audit rules from separate files. Same format as audit.rules(7).
audit_rule_files: [ '${path.config}/audit.rules.d/*.conf' ]
- module: file_integrity
paths:
- /bin
- /usr/bin
- /sbin
- /usr/sbin
- /etc
- /root/.ssh/authorized_keys
- module: system
datasets:
- host # General host information, e.g. uptime, IPs
- login # User logins, logouts, and system boots.
- package # Installed, updated, and removed packages
- process # Started and stopped processes
- socket # Opened and closed sockets
- user # User information
# How often datasets send state updates with the
# current state of the system (e.g. all currently
# running processes, all open sockets).
state.period: 1h
# How often auditbeat queries for new processes, sockets etc
metrics.period: 3s
# Enabled by default. Auditbeat will read password fields in
# /etc/passwd and /etc/shadow and store a hash locally to
# detect any changes.
user.detect_password_changes: true
# File patterns of the login record files.
login.wtmp_file_pattern: /var/log/wtmp*
login.btmp_file_pattern: /var/log/btmp*
output.logstash:
hosts: ***
ssl.certificate_authorities: ***
bulk_max_size: 2096
timeout: 15
setup.template.name: "logstash_auditbeat_template"
setup.template.pattern: "logstash-auditbeat-*"
setup.template.settings:
index.number_of_shards: 3
index.refresh_interval: 30s
processors:
- add_host_metadata:
netinfo.enabled: true
cache.ttl: 5m
We have the same configuration on all servers where auditbeat is deployed.
Also attaching screenshots with memory usage on load balancer and other example server over last couple days (sudden drop of memory usage on load balancer was due to restarting auditbeat daemon). As you can see, it looks pretty okay on the server that doesn't have a load balancer role.
@adrisr You were asking for data from other hosts (that are not load balancers).
As you can see on the screenshot, memory looks stable during last 72 hours.
@nickbabkin unfortunately there is not a lot of information in this profile. It only accounts for 3MB of allocated objects. I understand you got it via the -memprof argument which would have written the profile when the beat is terminated and most memory has been freed.
Can you try running Auditbeat with -httpprof :8888 (or any other port number) and then wait until the memory it's high and fetch a profile using:
From the logs it's clear that the number of monitored sockets is increasing constantly, going from about 400 to 64k in 5 days. This means either there is this many sockets open in your system (some app is creating and never closing them) or that Auditbeat is missing the inet_release events that cause it to cleanup sockets, about 8 events of this type lost per minute, but I don't see that many lost events in the logs.
Can you compare the number of open sockets in the system vs Auditbeat's state?
I can implement some expiration for sockets (as there is already for flows), so that they are cleaned up after some time. But I'd really would like to understand first what's going on, it still doesn't look like it's losing the socket close events.
Can you share with me some logs with the socketdetailed selector? It needs to be enabled separatedly because it's really really verbose.
Either logging.selectors: ["*", "socketdetailed"] or the -d '*,socketdetailed' command-line flag.
It'll log an extra 15-20k lines per second on your system. That will help see which events is Auditbeat receiving and in which order. Try to have at least 5min worth of logs please.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.