Auditbeat memory leak in 8.13.0

Hi!
I use auditbeat v8.13.0, and the memory grows very large.
The auditbeat running on a web server with less requests.
Below there's some screen shots:

There is the monitor screen shot:

Here is my config about audit:


# =========================== Modules configuration ============================
auditbeat.modules:

- module: auditd
  resolve_ids: true
  failure_mode: silent
  backlog_limit: 8196
  rate_limit: 0
  include_raw_message: false
  include_warnings: false

  audit_rules: |
  
    -a always,exit -F arch=b32 -S all -F key=32bit-abi
    -a always,exit -F arch=b64 -S execve,execveat -F auid>=1000 -F auid!=4294967295 -k exec
    -a always,exit -F arch=b64 -S accept,bind,connect -F auid>=1000 -F auid!=4294967295 -k external-access
    -a always,exit -F arch=b64 -S open,creat,truncate,ftruncate,openat,open_by_handle_at -F exit=-EACCES -k file-access
    -a always,exit -F arch=b64 -S open,creat,truncate,ftruncate,openat,open_by_handle_at -F exit=-EPERM -k file-access

- module: file_integrity
  paths:
  - /bin
  - /usr/bin
  - /usr/local/bin
  - /snap/bin
  - /sbin
  - /usr/sbin
  - /usr/local/sbin
  - /usr/games
  - /usr/local/games
  - /etc
  - /root
  - /home

  exclude_files:
  - '(?i)\.sw[nop]$'
  - '~$'
  - '/\.git($|/)'
  - '.bash_history'
  - '.viminfo.tmp'
  - '.viminfo'
  - '.virtualenvs'

  backend: fsnotify
  scan_at_start: true
  scan_rate_per_sec: 50 MiB
  max_file_size: 100 MiB
  hash_types: [sha1]
  recursive: true

- module: system
  datasets:
    - package # Installed, updated, and removed packages

  period: 2m # The frequency at which the datasets check for changes

- module: system
  datasets:
    - host    # General host information, e.g. uptime, IPs
    - login   # User logins, logouts, and system boots.
    - process # Started and stopped processes
    - socket  # Opened and closed sockets
    - user    # User information
  
  period: 10s
  state.period: 12h
  process.hash.scan_rate_per_sec: 50 MiB
  process.hash.max_file_size: 100 MiB
  process.hash.hash_types: [sha1]
  user.detect_password_changes: true
  login.wtmp_file_pattern: /var/log/wtmp*
  login.btmp_file_pattern: /var/log/btmp*


# ================================== General ===================================
timestamp.precision: microsecond


# ================================= Processors =================================

processors:
  - drop_event:
      when:
        and:
          - contains:
              tags: external-access
          - equals:
              event.action: connected-to
          - or:
            - has_fields: ['destination.path']
            - network:
                destination.ip: ['127.0.0.1/32', '::1/128', '172.16.0.0/12']

processors:
  - drop_event:
      when:
        and:
          - equals:
              event.dataset: socket
          - network:
              source.ip: ['127.0.0.1/32', '::1/128', '172.16.0.0/12']
          - network:
              destination.ip: ['127.0.0.1/32', '::1/128', '127.0.0.53/32', '172.16.0.0/12']

Thanks for the report @nemo

Do you think you could provide some additional information?

Were you running a previous version of auditbeat before? If yes, which version
How many running processes are on the machine during that high memory load?
What webserver are you running?
Can you provide any info on how many and what rate connections are being made to the webserver?

Thanks!

This is the first time I use auditbeat..
There's several uWsgi and Celery running on the server, and the connection's rate is so low...
And below is the network info:

And I made some changes: To set "socket.dns.enabled: false". And then the the memory growth rate is slower than before, but it is still growing...

This is the latest pprof:

I have checked the source code and suspect that it is caused by the "struct process" not being properly released, and that the copying of the member "resolvedDomains" has exacerbated memory growth.

@NickFritts

hello @nemo :wave: Could you please expand a bit more on your finding on the following

caused by the "struct process" not being properly released

I had a look and I can see thatthe entry of the process struct ptr that is in the process map is being deleted at process termination. However, I spotted also some potential leaks as this pointer is passed along in other places, namely in socket struct and flow struct, where we need to set it as nil when we are done with those. Are your findings similar or you spotted something else?

the copying of the member "resolvedDomains" has exacerbated memory growth

Because of this copying I would expect big steps in mem consumption, as shown in your graphs. However, given the workload you said you have, the mem consumption should be able to reach a plateau after increasing some times which doesn't happen for you.

caused by the "struct process" not being properly released

Just suspect....I haven't used go-lang before, so I can't debug it..

I tried to recompile the auditbeat and replaced resolvedDomains with a global lru cache and set "socket.dns.enabled: true". Below is the memory stat:

This operation has had some effect, but it still hasn't solved the problem...

The auditbeat has been running for over 24 hours. This is the latest heap image...

@Panos_Koutsovasilis

thanks for the reply @nemo, since you can compile auditbeat and run it under your scenario that exhibits this behaviour, would it possible to compile it from another branch than main which contains some potential fixes of this issue? Having this option is gonna be tremendously helpful as we can directly test my potential fixes in your use-case

No problem. Provide the branch, please.
@Panos_Koutsovasilis

amazing! here is the branch @nemo GitHub - pkoutsovasilis/beats at pkoutsovasilis/auditbeat_socket_memory. Try the same experiments you did before and let's see if nulling the process pointers from socket and flow struct is solving the issue. Ideally we would like to test this with an experiment run similar to the first graph you shown with the GB mem footprint

@Panos_Koutsovasilis
I run the auditbeat compiling by the branch, and the memory is stable for two days. Below is the mem monitor.


great news @nemo, ty for testing this out :slightly_smiling_face:. To me the above memory footprint graph makes perfect sense and I consider the leak to be fixed; do you agree? The next step is to create an issue on the beats repo; I can create this one but I figured since you reported this issue here and helped in testing it maybe you wanna do the honours? :smile:

@Panos_Koutsovasilis Sorry, but the memory is still growing slowly...Perhaps this is golang's GC mechanism?

looking at the graph above I can tell that there was some traffic happening at ~ Apr 28 16:00 which caused a memory bump; However the fact that afterwards the memory reached another plateau region makes me think that the leak is not present; To this end, given the memory pressure of the system which from the graph I assume there isn't any, it common for the golang GC to do this bumps in allocated memory and not give it back to the operating system so I think this is the case we observe here.

PS: I assume that this long-running experiment happened under the same configuration as the graph here that reached a memory footprint higher than 4.0 GB

@Panos_Koutsovasilis
Latest memory monitoring graph..

The memory was up to 500MB, this shouldn't be normal.
And the memory suddenly dropped because the auditbeat crashed. And the auditbeat compiling with the branch has crashed twice. The last log before crash:
{"log.level":"info","@timestamp":"2024-04-29T11:24:49.748Z","log.logger":"socket","log.origin":{"function":"runtime.gopanic","file.name":"runtime/panic.go","file.line":770},"message":"Stopping DNS capture.","service.name":"auditbeat","ecs.version":"1.6.0"}

thanks for the info @nemo that's really helpful! I will have another look for memory leaks but first I need to work out the crashing which is evidently caused by my pointer nilling. Could you maybe attach more log lines from the end, 50 at least should do it, because this last one you posted is already in the panic handler and I can't exactly trace it's caller?

@Panos_Koutsovasilis
The previous logs were similar to the following report logs, which were output every 30 seconds and had little value for crash issues.

{"log.level":"info","@timestamp":"2024-04-29T11:24:35.852Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"auditbeat","monitoring":{"metrics":{"auditd":{"received_msgs":452},"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":4481884160}}}},"cpu":{"system":{"ticks":1618520,"time":{"ms":140}},"total":{"ticks":4936940,"time":{"ms":960},"value":4936940},"user":{"ticks":3318420,"time":{"ms":820}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":220},"info":{"ephemeral_id":"d1d1d6c5-1049-4510-921d-6607c560b8a9","uptime":{"ms":275793950},"version":"8.15.0"},"memstats":{"gc_next":465282640,"memory_alloc":239535328,"memory_total":265877268896,"rss":456282112},"runtime":{"goroutines":82}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":68,"active":0,"batches":2,"total":68},"read":{"bytes":592,"errors":3},"write":{"bytes":16070,"latency":{"histogram":{"count":22071,"max":194,"mean":24.5927734375,"median":18,"min":7,"p75":33,"p95":54,"p99":88.5,"p999":193.1250000000008,"stddev":17.801948259651088}}}},"pipeline":{"clients":8,"events":{"active":72,"filtered":231,"published":130,"total":361},"queue":{"acked":68}}},"metricbeat":{"auditd":{"auditd":{"events":113,"success":113}},"system":{"socket":{"events":248,"success":248}}},"system":{"load":{"1":0.11,"15":0.44,"5":0.31,"norm":{"1":0.0138,"15":0.055,"5":0.0388}}}},"ecs.version":"1.6.0"}}