Auditbeat impacting system performance

Hi, could use some information on how auditbeat works so we can try and figure out where the limitation is that is dragging our system down.

We are just using auditbeat to monitor 1 directory and 1 file (auditbeat.yml). The directory as around 1.83 million files and is 6 terabytes in size and is 96% used (Yes we keep purging projects!) and is on a separate mount from the rest of the system. This is sent to Logstash on a remote server to then process.

It is a virtual system with 96Gbytes RAM, 16Gbytes Swap, dozens of CPUs and fast discs.

It is not noticeably impacting user performance that we can see, but it is impacting the two types of backup that we are running.

We have a disaster recovery (DR) system that we are running and every evening rsync is run and copies all the changes from the live server to the DR one. That normally takes around 5-6 hours, but with auditbeat running, this easily doubles the run time.

We are also using TSM to do a more 'normal' backup and this one has gone from 13-14 hours to getting to the point it never completes. If it takes longer than 24 hours to run, it does not start the next back up, it just completes the current one and it got to the point that the last run with auditbeat on was 72 hours. Basically taking longer and longer. Not good.

What I'm after is some guidance on how auditbeat works internally so that we can focus on seeing what we can do to alleviate this bottleneck.

To me this looks to be a disk issue, but I had thought that auditbeat was lightweight and ran in memory.

Any guidance gratefully received.

(PS. Not a Linux performance expert, so recognise I might need some guidance as to what to check!)

Please share the configuration that you are using, Auditbeat version, and operating system + version.

What requirements are you trying to meet by having Auditbeat monitor these files?

auditbeat-6.2.4-1.x86_64

Redhat 6

We want to monitor activity that is out of the norm and for certain activity on specific file types.

It is a fairly basic config file, but it was giving us what we wanted.

# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/auditbeat/index.html

#============================  Config Reloading ================================

# Config reloading allows to dynamically load modules. Each file which is
# monitored must contain one or multiple modules as a list.
auditbeat.config.modules:

  # Glob pattern for configuration reloading
  path: ${path.config}/conf.d/*.yml

  # Period on which files under path should be checked for changes
  reload.period: 10s

  # Set to true to enable config reloading
  reload.enabled: true

# Maximum amount of time to randomly delay the start of a metricset. Use 0 to
# disable startup delay.
auditbeat.max_start_delay: 10s

#==========================  Modules configuration =============================
auditbeat.modules:

# The kernel metricset collects events from the audit framework in the Linux
# kernel. You need to specify audit rules for the events that you want to audit.
- module: auditd
  resolve_ids: true
  failure_mode: log
  backlog_limit: 8196
  rate_limit: 0
  include_raw_message: false
  include_warnings: false
  audit_rules: |
    -w /etc/auditbeat/auditbeat.yml -p wa -k auditbeat_issue
    -w /etc/passwd -p wa -k passwd_changes
    -w /PTC/ -p wr -k ptc_code_access

#================================ General ======================================

# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
# If this options is not defined, the hostname is used.
name: ptc-desk    

#================================ Outputs ======================================

# Configure what output to use when sending the data collected by the beat.

#----------------------------- Logstash output ---------------------------------
output.logstash:
  hosts: ["soptct62-02.ptc.com:5044"]
  protocol: "https"

#  output.console:
#  pretty: true

#================================= Paths ======================================
# The data path for the auditbeat installation. This is the default base path
# for all the files in which auditbeat needs to store its data. If not set by a
# CLI flag or in the configuration file, the default for the data path is a data
# subdirectory inside the home path.
path.data: /MON/data

# The logs path for a auditbeat installation. This is the default location for
# the Beat's log files. If not set by a CLI flag or in the configuration file,
# the default for the logs path is a logs subdirectory inside the home path.
path.logs: /MON/logs

So the backup is being written to /PTC which is monitored for open syscalls with read or write flags (-p wr)? Do you need both read and write?

Auditbeat can't block the system calls to slow things down. The kernel queues the events internally and uses a separate "audit_send_list" thread to send them over netlink to Auditbeat.

So the slow down could be caused by Auditbeat using CPU to process the events (starving other processes) or it could be caused by a slow down in the kernel due to the auditing of all open syscalls (I imagine rsync generates a lot of activity).

If you stop Auditbeat after it has loaded the audit rules to the kernel does the performance change? The kernel will write the audit messages to syslog after Auditbeat is stopped.

What kind of CPU usage are you seeing from kauditd and auditbeat?

Close. The backup is being taken from /PTC and written to another system (rsync) or being backed up (TSM). So it is /PTC that auditbeat is monitoring.

Interesting on the wr flags. Not sure, I'll have a good think on whether I need both, might just be able to get away with read. What we are trying to understand is when something is being copied. Specifically out of certain directories within /PTC. I was going to try and use logstash to identify those. We'd rather not do the filtering on the server we are monitoring or we'd set that up for these specific directories, but projects come and go on the system and we'd probably need to automate auditbeat.yml changes as that happened.

What I was trying to understand was that if Auditbeat was busy, what would be the main limiting factor. Was it memory, CPU or Disk. My understanding was that Auditbeat did not use disk, but are there occasions when if the system or auditbeat was busy it would write out to disk?

System has 96GBytes of RAM (virtual) as well as 16Gbytes of Swap. Not sure on the number of CPUs (again virtual), but this can be increased if needed.

Yes, both rsync and TSM cause a lot of activity (as do a number of other processes) and I use logstash to filter that out before it gets to Elastic search.

CPU is high, but not always at the top of the list, but certainly in the top 5 most of the time.

Thanks for getting back to me.

@andrewkroh Hi Andrew, any further thoughts? What I'd like to know is Auditbeat just in memory, or does it use disks at all. If in memory, then that is simpler to solve. Thanks, N

Based on your configuration, the only disk usage by Auditbeat would be for logging to /var/log/auditbeat/auditbeat.

Thanks, there is nothing in there so must be looking at a CPU, Memory or Network bottleneck.